Understanding and Taming SSD Read Performance Variability: HDFS Case Study

Total Page:16

File Type:pdf, Size:1020Kb

Understanding and Taming SSD Read Performance Variability: HDFS Case Study Understanding and taming SSD read performance variability: HDFS case study Mar´ıa F. Borge Florin Dinu Willy Zwaenepoel University of Sydney University of Sydney EPFL and University of Sydney Abstract found in hardware. On the other hand, they can introduce their own performance bottlenecks and sources of variabil- In this paper we analyze the influence that lower layers (file ity [6, 21, 26]. system, OS, SSD) have on HDFS’ ability to extract max- In this paper we perform a deep dive into the performance imum performance from SSDs on the read path. We un- of a critical layer in today’s big data storage stacks, namely cover and analyze three surprising performance slowdowns the Hadoop Distributed File System (HDFS [25]). We par- induced by lower layers that result in HDFS read through- ticularly focus on the influence that lower layers have on put loss. First, intrinsic slowdown affects reads from every HDFS’ ability to extract the maximum throughput that the new file system extent for a variable amount of time. Sec- underlying storage is capable of providing. We focus on the ond, temporal slowdown appears temporarily and periodi- HDFS read path and on SSD as a storage media. We use cally and is workload-agnostic. Third, in permanent slow- ext4 as the file system due to its popularity. The HDFS read down, some files can individually and permanently become path is important because it can easily become a performance slower after a period of time. bottleneck for an application’s input stage which accesses We analyze the impact of these slowdowns on HDFS and slower storage media compared to later stages which are of- show significant throughput loss. Individually, each of the ten optimized to work fully in memory [31]. slowdowns can cause a read throughput loss of 10-15%. Central to our exploration is the storage access pattern of However, their effect is cumulative. When all slowdowns a single HDFS read request: single threaded, sequential ac- happen concurrently, read throughput drops by as much as cess to large files (hundreds of megabytes) using buffered 30%. We further analyze mitigation techniques and show IO. This access pattern is simple but tried and tested and has that two of the three slowdowns could be addressed via in- remained unchanged since the beginnings of HDFS. It is in- creased IO request parallelism in the lower layers. Unfortu- creasingly important to understand whether a single HDFS nately, HDFS cannot automatically adapt to use such addi- read using this access pattern can consistently extract by it- tional parallelism. Our results point to a need for adaptabil- self the maximum throughput that the underlying storage can ity in storage stacks. The reason is that an access pattern that provide. As many big data processing systems are heavily maximizes performance in the common case is not necessar- IO provisioned and the ratio of cores to disks reaches 1 [1], ily the same one that can mask performance fluctuations. arXiv:1903.09347v1 [cs.OS] 22 Mar 2019 relying on task-level parallelism to generate enough parallel requests to saturate storage is no longer sufficient. Moreover, 1 Introduction relying on such parallelism is detrimental to application per- formance since each of the concurrent HDFS reads is served Layering is a popular way to design big data storage slower than it would be in isolation. stacks [24]. Distributed file systems (HDFS [25], GFS [9]) With proper parameter configuration, the HDFS read ac- are usually layered on top of a local file system (ext4 [19], cess pattern is sufficient to extract maximum performance Btrfs [23], F2FS [17], XFS [28], ZFS [2]) running in an OS. out of our SSDs. However, we uncover and analyze three This approach is desirable because it encourages rapid devel- surprising performance slowdowns that can affect the HDFS opment by reusing existing functionality. An important goal read path at different timescales (short, medium and long) for such storage stacks is to extract maximum performance and result in throughput loss. All three slowdowns are from the underlying storage. With respect to this goal, lay- caused by lower layers (file system, SSDs). Furthermore, ering is a double-edge sword [22]. On one hand, the OS and our analysis shows that the three slowdowns can affect not the file system can compensate for and mask inefficiencies only HDFS but any application that uses buffered reads. As opposed to related work that shows worn out SSDs could • We analyze mitigation techniques and show that two of cause various performance problems [3, 10], our slowdowns the three slowdowns can be addressed via increased par- occur on SSD with very low usage throughout their lifetime. allelism. The first slowdown, which we call intrinsic slowdown, af- The rest of the paper continues as follows. x2 presents fects HDFS reads at short time scales (seconds). HDFS read HDFS architecture and the configuration that we use. x3 throughput drops at the start of every new ext4 extent read. presents our experimental setup and the metrics used. x4, A variable number of IO requests from the start of each ex- x5 and x6 introduce and analyze intrinsic, temporal and per- tent are served with increased latency regardless of the extent manent slowdown respectively. x7 discusses the findings, x8 size. The time necessary for throughput to recover is also presents related work and x9 concludes. variable as it depends on the number of request affected. The second slowdown, which we call temporal slowdown, affects HDFS reads at medium time scales (tens of minutes 2 Background to hours). Tail latencies inside the drive increase periodi- cally and temporarily and cause HDFS read throughput loss. 2.1 HDFS Architecture While this slowdown may be confused with write-triggered HDFS is a user-level distributed file system that runs layered SSD garbage collection [8], we find, surprisingly, that it ap- on top of a local file system (e.g. ext4, zfs, btrfs). An HDFS pears in a workload-agnostic manner. file is composed of several blocks, and each block is stored The third slowdown, which we call permanent slowdown, as one separate file in the local file system. Blocks are large affects HDFS reads at long time scales (days to weeks). Af- files; a common size is 256 MB. ter a period of time, HDFS read throughput from a file per- The HDFS design follows a server-client architecture. The manently drops and never recovers for that specific file. Im- server, called the NameNode, is a logically centralized meta- portantly this is not caused by a single drive-wide malfunc- data manager that also decides file placement and performs tion event but rather it is an issue that affects files individu- load balancing. The clients, called DataNodes, work at the ally and at different points in time. The throughput loss is level of HDFS blocks and provide block data to requests caused by latency increases inside the drive, but, compared coming from compute tasks. to temporal slowdown all requests are affected, not just the To perform an HDFS read request, a compute task (e.g. tail. Hadoop mapper) first contacts the NameNode to find all the Interestingly, we find that two of the three slowdowns can DataNodes holding a copy of the desired data block. The be completely masked by increased parallelism in the lower task then chooses a DataNode and asks for the entire block layers, yet HDFS cannot trigger this increased parallelism of data. The DataNode reads data from the drive and sends and performance suffers. Our results point to a need for it to the task. The size DataNode’s reads is controlled by adaptability in storage stacks. An access pattern that maxi- the parameter io.file.buffer.size (default 64KB). The DataN- mizes performance in the common case is not necessarily the ode sends as much data as allowed by the OS socket buffer one that can mask unavoidable hardware performance fluc- sizes. A DataNode normally uses the sendfile system call tuations. to read data from a drive and send it to the task. This ap- With experiments on 3 SSD based systems, we show that proach is general and handles both local tasks (i.e. on the each of the slowdowns we identified can individually intro- same node as the DataNode) as well as remote tasks. As duce at least 10-15% HDFS read throughput degradation. an optimization, a task can bypass the DataNode for local The effect is cumulative. In the worst case, all slowdowns reads (short-circuit reads) and read data directly using stan- can overlap leading to a 30% throughput loss. dard read system calls. The contributions of the paper are as follows: • We identify and analyze the intrinsic slowdown affect- 2.2 HDFS Access Pattern ing reads at the start of every new ext4 extent for a vari- We now summarize the main characteristics of the HDFS able amount of time. read access pattern since this pattern is central to our work. • We identify and analyze the temporal slowdown that • Single-threaded. One request from a compute task is affects reads temporarily and periodically while being handled by a single worker thread in the DataNode. workload-agnostic. • Large files. HDFS blocks are large files (e.g. 128MB, • We identify and analyze the permanent slowdown af- 256MB). Each HDFS block is a separate file in the local fecting reads from an individual file over time. file system. • We analyze the impact of these slowdowns on HDFS • Sequential access. The HDFS reads access data se- performance and show significant throughput loss. quentially for performance reasons. 2 • Buffered IO. HDFS uses buffered IO in the DataNode 3.1 Experimental Setup (via sendfile or read system calls).
Recommended publications
  • The Google File System (GFS)
    The Google File System (GFS), as described by Ghemwat, Gobioff, and Leung in 2003, provided the architecture for scalable, fault-tolerant data management within the context of Google. These architectural choices, however, resulted in sub-optimal performance in regard to data trustworthiness (security) and simplicity. Additionally, the application- specific nature of GFS limits the general scalability of the system outside of the specific design considerations within the context of Google. SYSTEM DESIGN: The authors enumerate their specific considerations as: (1) commodity components with high expectation of failure, (2) a system optimized to handle relatively large files, particularly multi-GB files, (3) most writes to the system are concurrent append operations, rather than internally modifying the extant files, and (4) a high rate of sustained bandwidth (Section 1, 2.1). Utilizing these considerations, this paper analyzes the success of GFS’s design in achieving a fault-tolerant, scalable system while also considering the faults of the system with regards to data-trustworthiness and simplicity. Data-trustworthiness: In designing the GFS system, the authors made the conscious decision not prioritize data trustworthiness by allowing applications to access ‘stale’, or not up-to-date, data. In particular, although the system does inform applications of the chunk-server version number, the designers of GFS encouraged user applications to cache chunkserver information, allowing stale data accesses (Section 2.7.2, 4.5). Although possibly troubling
    [Show full text]
  • A Review on GOOGLE File System
    International Journal of Computer Science Trends and Technology (IJCST) – Volume 4 Issue 4, Jul - Aug 2016 RESEARCH ARTICLE OPEN ACCESS A Review on Google File System Richa Pandey [1], S.P Sah [2] Department of Computer Science Graphic Era Hill University Uttarakhand – India ABSTRACT Google is an American multinational technology company specializing in Internet-related services and products. A Google file system help in managing the large amount of data which is spread in various databases. A good Google file system is that which have the capability to handle the fault, the replication of data, make the data efficient, memory storage. The large data which is big data must be in the form that it can be managed easily. Many applications like Gmail, Facebook etc. have the file systems which organize the data in relevant format. In conclusion the paper introduces new approaches in distributed file system like spreading file’s data across storage, single master and appends writes etc. Keywords:- GFS, NFS, AFS, HDFS I. INTRODUCTION III. KEY IDEAS The Google file system is designed in such a way that 3.1. Design and Architecture: GFS cluster consist the data which is spread in database must be saved in of single master and multiple chunk servers used by a arrange manner. multiple clients. Since files to be stored in GFS are The arrangement of data is in the way that it can large, processing and transferring such huge files can reduce the overhead load on the server, the consume a lot of bandwidth. To efficiently utilize availability must be increased, throughput should be bandwidth files are divided into large 64 MB size highly aggregated and much more services to make chunks which are identified by unique 64-bit chunk the available data more accurate and reliable.
    [Show full text]
  • A Large-Scale Study of File-System Contents
    A Large-Scale Study of File-System Contents John R. Douceur and William J. Bolosky Microsoft Research Redmond, WA 98052 {johndo, bolosky}@microsoft.com ABSTRACT sizes are fairly consistent across file systems, but file lifetimes and file-name extensions vary with the job function of the user. We We collect and analyze a snapshot of data from 10,568 file also found that file-name extension is a good predictor of file size systems of 4801 Windows personal computers in a commercial but a poor predictor of file age or lifetime, that most large files are environment. The file systems contain 140 million files totaling composed of records sized in powers of two, and that file systems 10.5 TB of data. We develop analytical approximations for are only half full on average. distributions of file size, file age, file functional lifetime, directory File-system designers require usage data to test hypotheses [8, size, and directory depth, and we compare them to previously 10], to drive simulations [6, 15, 17, 29], to validate benchmarks derived distributions. We find that file and directory sizes are [33], and to stimulate insights that inspire new features [22]. File- fairly consistent across file systems, but file lifetimes vary widely system access requirements have been quantified by a number of and are significantly affected by the job function of the user. empirical studies of dynamic trace data [e.g. 1, 3, 7, 8, 10, 14, 23, Larger files tend to be composed of blocks sized in powers of two, 24, 26]. However, the details of applications’ and users’ storage which noticeably affects their size distribution.
    [Show full text]
  • F1 Query: Declarative Querying at Scale
    F1 Query: Declarative Querying at Scale Bart Samwel John Cieslewicz Ben Handy Jason Govig Petros Venetis Chanjun Yang Keith Peters Jeff Shute Daniel Tenedorio Himani Apte Felix Weigel David Wilhite Jiacheng Yang Jun Xu Jiexing Li Zhan Yuan Craig Chasseur Qiang Zeng Ian Rae Anurag Biyani Andrew Harn Yang Xia Andrey Gubichev Amr El-Helw Orri Erling Zhepeng Yan Mohan Yang Yiqun Wei Thanh Do Colin Zheng Goetz Graefe Somayeh Sardashti Ahmed M. Aly Divy Agrawal Ashish Gupta Shiv Venkataraman Google LLC [email protected] ABSTRACT 1. INTRODUCTION F1 Query is a stand-alone, federated query processing platform The data processing and analysis use cases in large organiza- that executes SQL queries against data stored in different file- tions like Google exhibit diverse requirements in data sizes, la- based formats as well as different storage systems at Google (e.g., tency, data sources and sinks, freshness, and the need for custom Bigtable, Spanner, Google Spreadsheets, etc.). F1 Query elimi- business logic. As a result, many data processing systems focus on nates the need to maintain the traditional distinction between dif- a particular slice of this requirements space, for instance on either ferent types of data processing workloads by simultaneously sup- transactional-style queries, medium-sized OLAP queries, or huge porting: (i) OLTP-style point queries that affect only a few records; Extract-Transform-Load (ETL) pipelines. Some systems are highly (ii) low-latency OLAP querying of large amounts of data; and (iii) extensible, while others are not. Some systems function mostly as a large ETL pipelines. F1 Query has also significantly reduced the closed silo, while others can easily pull in data from other sources.
    [Show full text]
  • A Study of Cryptographic File Systems in Userspace
    Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 4507-4513 Research Article A study of cryptographic file systems in userspace a b c d e f Sahil Naphade , Ajinkya Kulkarni Yash Kulkarni , Yash Patil , Kaushik Lathiya , Sachin Pande a Department of Information Technology PICT, Pune, India [email protected] b Department of Information Technology PICT, Pune, India [email protected] c Department of Information Technology PICT, Pune, India [email protected] d Department of Information Technology PICT, Pune, India [email protected] e Veritas Technologies Pune, India, [email protected] f Department of Information Technology PICT, Pune, India [email protected] Article History: Received: 10 January 2021; Revised: 12 February 2021; Accepted: 27 March 2021; Published online: 28 April 2021 Abstract: With the advancements in technology and digitization, the data storage needs are expanding; along with the data breaches which can expose sensitive data to the world. Thus, the security of the stored data is extremely important. Conventionally, there are two methods of storage of the data, the first being hiding the data and the second being encryption of the data. However, finding out hidden data is simple, and thus, is very unreliable. The second method, which is encryption, allows for accessing the data by only the person who encrypted the data using his passkey, thus allowing for higher security. Typically, a file system is implemented in the kernel of the operating systems. However, with an increase in the complexity of the traditional file systems like ext3 and ext4, the ones that are based in the userspace of the OS are now allowing for additional features on top of them, such as encryption-decryption and compression.
    [Show full text]
  • Latency-Aware, Inline Data Deduplication for Primary Storage
    iDedup: Latency-aware, inline data deduplication for primary storage Kiran Srinivasan, Tim Bisson, Garth Goodson, Kaladhar Voruganti NetApp, Inc. fskiran, tbisson, goodson, [email protected] Abstract systems exist that deduplicate inline with client requests for latency sensitive primary workloads. All prior dedu- Deduplication technologies are increasingly being de- plication work focuses on either: i) throughput sensitive ployed to reduce cost and increase space-efficiency in archival and backup systems [8, 9, 15, 21, 26, 39, 41]; corporate data centers. However, prior research has not or ii) latency sensitive primary systems that deduplicate applied deduplication techniques inline to the request data offline during idle time [1, 11, 16]; or iii) file sys- path for latency sensitive, primary workloads. This is tems with inline deduplication, but agnostic to perfor- primarily due to the extra latency these techniques intro- mance [3, 36]. This paper introduces two novel insights duce. Inherently, deduplicating data on disk causes frag- that enable latency-aware, inline, primary deduplication. mentation that increases seeks for subsequent sequential Many primary storage workloads (e.g., email, user di- reads of the same data, thus, increasing latency. In addi- rectories, databases) are currently unable to leverage the tion, deduplicating data requires extra disk IOs to access benefits of deduplication, due to the associated latency on-disk deduplication metadata. In this paper, we pro- costs. Since offline deduplication systems impact la-
    [Show full text]
  • Cloud Computing Bible Is a Wide-Ranging and Complete Reference
    A thorough, down-to-earth look Barrie Sosinsky Cloud Computing Barrie Sosinsky is a veteran computer book writer at cloud computing specializing in network systems, databases, design, development, The chance to lower IT costs makes cloud computing a and testing. Among his 35 technical books have been Wiley’s Networking hot topic, and it’s getting hotter all the time. If you want Bible and many others on operating a terra firma take on everything you should know about systems, Web topics, storage, and the cloud, this book is it. Starting with a clear definition of application software. He has written nearly 500 articles for computer what cloud computing is, why it is, and its pros and cons, magazines and Web sites. Cloud Cloud Computing Bible is a wide-ranging and complete reference. You’ll get thoroughly up to speed on cloud platforms, infrastructure, services and applications, security, and much more. Computing • Learn what cloud computing is and what it is not • Assess the value of cloud computing, including licensing models, ROI, and more • Understand abstraction, partitioning, virtualization, capacity planning, and various programming solutions • See how to use Google®, Amazon®, and Microsoft® Web services effectively ® ™ • Explore cloud communication methods — IM, Twitter , Google Buzz , Explore the cloud with Facebook®, and others • Discover how cloud services are changing mobile phones — and vice versa this complete guide Understand all platforms and technologies www.wiley.com/compbooks Shelving Category: Use Google, Amazon, or
    [Show full text]
  • On the Performance Variation in Modern Storage Stacks
    On the Performance Variation in Modern Storage Stacks Zhen Cao1, Vasily Tarasov2, Hari Prasath Raman1, Dean Hildebrand2, and Erez Zadok1 1Stony Brook University and 2IBM Research—Almaden Appears in the proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17) Abstract tions on different machines have to compete for heavily shared resources, such as network switches [9]. Ensuring stable performance for storage stacks is im- In this paper we focus on characterizing and analyz- portant, especially with the growth in popularity of ing performance variations arising from benchmarking hosted services where customers expect QoS guaran- a typical modern storage stack that consists of a file tees. The same requirement arises from benchmarking system, a block layer, and storage hardware. Storage settings as well. One would expect that repeated, care- stacks have been proven to be a critical contributor to fully controlled experiments might yield nearly identi- performance variation [18, 33, 40]. Furthermore, among cal performance results—but we found otherwise. We all system components, the storage stack is the corner- therefore undertook a study to characterize the amount stone of data-intensive applications, which become in- of variability in benchmarking modern storage stacks. In creasingly more important in the big data era [8, 21]. this paper we report on the techniques used and the re- Although our main focus here is reporting and analyz- sults of this study. We conducted many experiments us- ing the variations in benchmarking processes, we believe ing several popular workloads, file systems, and storage that our observations pave the way for understanding sta- devices—and varied many parameters across the entire bility issues in production systems.
    [Show full text]
  • Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO
    Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO..........................................................................................................................................1 Martin Hinner < [email protected]>, http://martin.hinner.info............................................................1 1. Introduction..........................................................................................................................................1 2. Volumes...............................................................................................................................................1 3. DOS FAT 12/16/32, VFAT.................................................................................................................2 4. High Performance FileSystem (HPFS)................................................................................................2 5. New Technology FileSystem (NTFS).................................................................................................2 6. Extended filesystems (Ext, Ext2, Ext3)...............................................................................................2 7. Macintosh Hierarchical Filesystem − HFS..........................................................................................3 8. ISO 9660 − CD−ROM filesystem.......................................................................................................3 9. Other filesystems.................................................................................................................................3
    [Show full text]
  • Mapreduce: Simplified Data Processing On
    MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat [email protected], [email protected] Google, Inc. Abstract given day, etc. Most such computations are conceptu- ally straightforward. However, the input data is usually MapReduce is a programming model and an associ- large and the computations have to be distributed across ated implementation for processing and generating large hundreds or thousands of machines in order to finish in data sets. Users specify a map function that processes a a reasonable amount of time. The issues of how to par- key/value pair to generate a set of intermediate key/value allelize the computation, distribute the data, and handle pairs, and a reduce function that merges all intermediate failures conspire to obscure the original simple compu- values associated with the same intermediate key. Many tation with large amounts of complex code to deal with real world tasks are expressible in this model, as shown these issues. in the paper. As a reaction to this complexity, we designed a new Programs written in this functional style are automati- abstraction that allows us to express the simple computa- cally parallelized and executed on a large cluster of com- tions we were trying to perform but hides the messy de- modity machines. The run-time system takes care of the tails of parallelization, fault-tolerance, data distribution details of partitioning the input data, scheduling the pro- and load balancing in a library. Our abstraction is in- gram's execution across a set of machines, handling ma- spired by the map and reduce primitives present in Lisp chine failures, and managing the required inter-machine and many other functional languages.
    [Show full text]
  • CD ROM Drives and Their Handling Under RISC OS Explained
    28th April 1995 Support Group Application Note Number: 273 Issue: 0.4 **DRAFT** Author: DW CD ROM Drives and their Handling under RISC OS Explained This Application Note details the interfacing of, and protocols underlying, CD ROM drives and their handling under RISC OS. Applicable Related Hardware : Application All Acorn computers running Notes: 266: Developing CD ROM RISC OS 3.1 or greater products for Acorn machines 269: File transfer between Acorn and foreign platforms Every effort has been made to ensure that the information in this leaflet is true and correct at the time of printing. However, the products described in this leaflet are subject to continuous development and Support Group improvements and Acorn Computers Limited reserves the right to change its specifications at any time. Acorn Computers Limited Acorn Computers Limited cannot accept liability for any loss or damage arising from the use of any information or particulars in this leaflet. Acorn, the Acorn Logo, Acorn Risc PC, ECONET, AUN, Pocket Acorn House Book and ARCHIMEDES are trademarks of Acorn Computers Limited. Vision Park ARM is a trademark of Advance RISC Machines Limited. Histon, Cambridge All other trademarks acknowledged. ©1995 Acorn Computers Limited. All rights reserved. CB4 4AE Support Group Application Note No. 273, Issue 0.4 **DRAFT** 28th April 1995 Table of Contents Introduction 3 Interfacing CD ROM Drives 3 Features of CD ROM Drives 4 CD ROM Formatting Standards ("Coloured Books") 5 CD ROM Data Storage Standards 7 CDFS and CD ROM Drivers 8 Accessing Data on CD ROMs Produced for Non-Acorn Platforms 11 Troubleshooting 12 Appendix A: Useful Contacts 13 Appendix B: PhotoCD Mastering Bureaux 14 Appendix C: References 14 Support Group Application Note No.
    [Show full text]
  • Google File System 2.0: a Modern Design and Implementation
    Google File System 2.0: A Modern Design and Implementation Babatunde Micheal Okutubo Gan Tu Xi Cheng Stanford University Stanford University Stanford University [email protected] [email protected] [email protected] Abstract build a GFS-2.0 from scratch, with modern software design principles and implementation patterns, in order to survey The Google File System (GFS) presented an elegant de- ways to not only effectively achieve various system require- sign and has shown tremendous capability to scale and to ments, but also with great support for high-concurrency tolerate fault. However, there is no concrete implementa- workloads and software extensibility. tion of GFS being open sourced, even after nearly 20 years A second motivation of this project is that GFS has a of its initial development. In this project, we attempt to build single master, which served well back in 2003 when this a GFS-2.0 in C++ from scratch1 by applying modern soft- work was published. However, as data volume, network ware design discipline. A fully functional GFS has been de- capacity as well as hardware capability all have been dras- veloped to support concurrent file creation, read and paral- tically improved from that date, our hypothesis is that a lel writes. The system is capable of surviving chunk servers multi-master version of GFS is advantageous as it provides going down during both read and write operations. Various better fault tolerance (especially from the master servers’ crucial implementation details have been discussed, and a point of view), availability and throughput. Although the micro-benchmark has been conducted and presented.
    [Show full text]