Improving Linux Block I/O for Enterprise Workloads

Total Page:16

File Type:pdf, Size:1020Kb

Improving Linux Block I/O for Enterprise Workloads Improving Linux Block I/O for Enterprise Workloads Peter Wai Yee Wong, Badari Pulavarty, Shailabh Nagar, Janet Morgan, Jonathan Lahr, Bill Hartner, Hubertus Franke, Suparna Bhattacharya IBM Linux Technology Center {wpeter,pbadari,nagar,janetinc,lahr,bhartner,frankeh}@us.ibm.com, [email protected] http://lse.sourceforge.net/ Abstract then, the kernel developer community has re- doubled its efforts in improving the scalabil- ity of Linux on a variety of server platforms. The block I/O subsystem of the Linux kernel All major server vendors such as IBM, HP, is one of the critical components affecting the SGI, Compaq, Dell and Sun not only support performance of server workloads. Servers typ- Linux on their platforms, but are investing a ically scale their I/O bandwidth by increasing considerable effort in improving Linux’s enter- the number of attached disks and controllers. prise capabilities. The Linux Technology Cen- Hence, the scalability of the block I/O layer is ter (LTC) of IBM, in particular, has been a ma- also an important concern. jor contributor in improving Linux kernel per- In this paper, we examine the performance of formance and scalability. This paper highlights the 2.4 Linux kernel’s block I/O subsystem on the efforts of the LTC in improving the perfor- enterprise workloads. We identify some of mance and scalability of the block I/O subsys- the major bottlenecks in the block layer and tem of the Linux kernel. propose kernel modifications to alleviate these Traditionally, the kernel block I/O subsystem problems in the context of the 2.4 kernel. The has been one of the critical components af- performance impact of the proposed patches fecting server workload performance. While is shown using a decision-support workload, a I/O hardware development has made impres- microbenchmark, and profiling tools. We also sive gains in increasing disk capacity and re- examine the newly rewritten block layer of the ducing disk size, there is an increasing gap 2.5 kernel to see if it addresses the performance between disk latencies and processor speeds bottlenecks discovered earlier. or memory access times. Disk accesses are slower than memory accesses by two orders 1 Introduction of magnitude. Consequently, servers running I/O intensive workloads need to use large num- bers of disks and controllers to provide suffi- Over the past few years, Linux has made re- cient I/O bandwidth to enterprise applications. markable progress in becoming a server oper- In such environments, the kernel’s block I/O ating system. The release of Version 2.4 of layer faces a twofold challenge: it must scale the Linux kernel has been heralded as helping well with a large number of I/O devices and Linux break the enterprise barrier [5]. Since Ottawa Linux Symposium 2002 391 it must minimize the kernel overhead for each I/O requests into 512-byte units (even if I/O transfer. the device hardware and associated driver is capable of handling larger requests). This paper examines how the Linux kernel’s The 512-byte requests end up being re- block I/O subsystem handles these twin goals combined within the request queue before of scalability and performance. Using version being processed by the device driver. We 2.4.17 of the kernel as a baseline, we system- present an alternative that permits raw I/O atically identify I/O performance bottlenecks to be done at a page-size granularity. using kernel profiling tools. We propose solu- tions in the form of kernel patches, all but one • Efficient support for vector I/O: I/O in- of which has been developed by the authors. tensive applications often need to perform The performance improvements resulting from vector (scatter/gather) raw I/O operations these patches are presented using a decision- which transfer a contiguous region on disk support workload, a disk I/O microbenchmark to discontiguous memory regions in the and profiling data. In brief, the I/O perfor- application’s address space. The Linux mance bottlenecks addressed are as follows: kernel currently handles vectored raw I/O by doing a succession of blocking I/O op- erations on each individual element of the • Avoiding the use of bounce buffers: The I/O vector. We implement efficient sup- kernel can directly map only the first gi- port for vector I/O by allowing the vector gabyte of physical memory. I/O to high elements to be processed together as far as memory (beyond 1 GB) is done through possible. buffers defined in low memory and in- volves an extra copy of the data being • Lightweight kiobufs: The main data transferred. Capitalizing on the ability of structure used in raw I/O operations is the PCI devices to directly address all 4GB, kiobuf. As defined in 2.4.17, the kiobuf the block-highmem patch written by Jens data structure is very large. When raw I/O Axboe can circumvent the need to use is performed on a large number of devices, bounce buffers. the memory consumed by kiobufs is pro- hibitive. We demonstrate a simple way to • Splitting the I/O request lock: Each I/O reduce the size of the kiobuf structure and device in the system has an associated re- allow more I/O devices to be used for a quest queue which provides ordering and given amount of system memory. memory resources for managing I/O re- quests to the device. In the 2.4 kernel, all I/O request queues are protected by Most of the kernel performance bottlenecks a single io_request_lock which can listed above stem from the basic design of the be highly contended on SMP machines 2.4 block I/O subsystem which relies on buffer with multiple disks and a heavy I/O load. heads and kiobufs. The need to maintain com- We propose a solution that effectively re- patibility with a large number of device drivers places the io_request_lock with per queue has limited the scope for kernel developers to locks. fix the subsystem as a whole. In the 2.5 devel- opment kernel, however, the challenging task • Page-sized raw I/O transfers: Raw of overhauling the block I/O layer has been I/O, which refers to unbuffered I/O done taken up. One of the goals of the rewrite through the /dev/raw interface, breaks has been addressing the scalability problems of Ottawa Linux Symposium 2002 392 earlier designs [2]. This paper discusses the . raw_open new design in light of the performance bottle- . alloc_kiovec necks described earlier. The rest of the paper is organized as follows. Notice the call to alloc_kiovec to allocate Section 2 presents an overview of the 2.4 ker- a kernel I/O buffer, also known as a kiobuf. nel block I/O subsystem. The benchmark en- The kiobuf is the primary I/O abstraction used vironment and workloads used are described by the Linux kernel to support raw I/O. The in Section 3. Sections 4 through 8 describe kiobuf structure describes the array of pages the performance and resource scalability bot- that make up an I/O operation. tlenecks, proposed solutions and results. The The fields of a kiobuf structure include: newly written 2.5 kernel block layer is ad- dressed in Section 9. Section 10 concludes // number of pages in the kiobuf with directions for future work. int nr_pages; // number of bytes in the data buffer 2 Linux 2.4 Block I/O int length; // offset to first valid byte For the purpose of this paper, our review of // of the buffer the 2.4 kernel block I/O subsystem will be lim- int offset; ited in scope. Specifically, it will focus on the “raw” device interface, which was added by // list of device block numbers // for the I/O Stephen Tweedie during the Linux 2.3 devel- ulong blocks[KIO_MAX_SECTORS]; opment series. // array of pointers to Unix has traditionally provided a raw interface // 1024 pre-allocated to some devices, block devices in particular, // buffer heads which allows data to be transferred between a struct buffer_head user buffer and a device without copying the *bh[KIO_MAX_SECTORS]; data through the kernel’s buffer cache. This // array of up to 129 page mechanism can boost performance if the data // structures, one for each is unlikely to be used again in the short term // page of data in the kiobuf (during a disk backup, for example), or for ap- struct page plications such as large database management **maplist[KIO_STATIC_PAGES]; systems that perform their own caching. The maplist array is key to the kiobuf in- To use the raw interface, a device binding must terface, since functions that operate on pages be estabished via the raw command; for exam- stored in a kiobuf deal directly with page struc- ple, raw /dev/raw/raw1 /dev/sda1. tures. This approach helps hide the complexi- Once bound to a block device, a raw device can ties of the virtual memory system from device be opened just like any other device. drivers – a primary goal of the kiobuf interface. A sampling of the kernel code path for a raw Once the raw device is opened, it can be read open is as follows: and written just like the block device to which it is bound. However, raw I/O to a block device sys_open must always be sector aligned, and its length Ottawa Linux Symposium 2002 393 must be a multiple of the sector size. The sector Requests are dequeued when the size for most devices is 512 bytes. scheduled tq_disk task calls run_task_queue() which invokes Let us examine the code path for a raw device generic_unplug_device().
Recommended publications
  • Linux on the Road
    Linux on the Road Linux with Laptops, Notebooks, PDAs, Mobile Phones and Other Portable Devices Werner Heuser <wehe[AT]tuxmobil.org> Linux Mobile Edition Edition Version 3.22 TuxMobil Berlin Copyright © 2000-2011 Werner Heuser 2011-12-12 Revision History Revision 3.22 2011-12-12 Revised by: wh The address of the opensuse-mobile mailing list has been added, a section power management for graphics cards has been added, a short description of Intel's LinuxPowerTop project has been added, all references to Suspend2 have been changed to TuxOnIce, links to OpenSync and Funambol syncronization packages have been added, some notes about SSDs have been added, many URLs have been checked and some minor improvements have been made. Revision 3.21 2005-11-14 Revised by: wh Some more typos have been fixed. Revision 3.20 2005-11-14 Revised by: wh Some typos have been fixed. Revision 3.19 2005-11-14 Revised by: wh A link to keytouch has been added, minor changes have been made. Revision 3.18 2005-10-10 Revised by: wh Some URLs have been updated, spelling has been corrected, minor changes have been made. Revision 3.17.1 2005-09-28 Revised by: sh A technical and a language review have been performed by Sebastian Henschel. Numerous bugs have been fixed and many URLs have been updated. Revision 3.17 2005-08-28 Revised by: wh Some more tools added to external monitor/projector section, link to Zaurus Development with Damn Small Linux added to cross-compile section, some additions about acoustic management for hard disks added, references to X.org added to X11 sections, link to laptop-mode-tools added, some URLs updated, spelling cleaned, minor changes.
    [Show full text]
  • Dm-X: Protecting Volume-Level Integrity for Cloud Volumes and Local
    dm-x: Protecting Volume-level Integrity for Cloud Volumes and Local Block Devices Anrin Chakraborti Bhushan Jain Jan Kasiak Stony Brook University Stony Brook University & Stony Brook University UNC, Chapel Hill Tao Zhang Donald Porter Radu Sion Stony Brook University & Stony Brook University & Stony Brook University UNC, Chapel Hill UNC, Chapel Hill ABSTRACT on Amazon’s Virtual Private Cloud (VPC) [2] (which provides The verified boot feature in recent Android devices, which strong security guarantees through network isolation) store deploys dm-verity, has been overwhelmingly successful in elim- data on untrusted storage devices residing in the public cloud. inating the extremely popular Android smart phone rooting For example, Amazon VPC file systems, object stores, and movement [25]. Unfortunately, dm-verity integrity guarantees databases reside on virtual block devices in the Amazon are read-only and do not extend to writable volumes. Elastic Block Storage (EBS). This paper introduces a new device mapper, dm-x, that This allows numerous attack vectors for a malicious cloud efficiently (fast) and reliably (metadata journaling) assures provider/untrusted software running on the server. For in- volume-level integrity for entire, writable volumes. In a direct stance, to ensure SEC-mandated assurances of end-to-end disk setup, dm-x overheads are around 6-35% over ext4 on the integrity, banks need to guarantee that the external cloud raw volume while offering additional integrity guarantees. For storage service is unable to remove log entries documenting cloud storage (Amazon EBS), dm-x overheads are negligible. financial transactions. Yet, without integrity of the storage device, a malicious cloud storage service could remove logs of a coordinated attack.
    [Show full text]
  • SGI™ Propack 1.3 for Linux™ Start Here
    SGI™ ProPack 1.3 for Linux™ Start Here Document Number 007-4062-005 © 1999—2000 Silicon Graphics, Inc.— All Rights Reserved The contents of this document may not be copied or duplicated in any form, in whole or in part, without the prior written permission of Silicon Graphics, Inc. LIMITED AND RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure by the Government is subject to restrictions as set forth in the Rights in Data clause at FAR 52.227-14 and/or in similar or successor clauses in the FAR, or in the DOD, DOE or NASA FAR Supplements. Unpublished rights reserved under the Copyright Laws of the United States. Contractor/ manufacturer is SGI, 1600 Amphitheatre Pkwy., Mountain View, CA 94043-1351. Silicon Graphics is a registered trademark and SGI and SGI ProPack for Linux are trademarks of Silicon Graphics, Inc. Intel is a trademark of Intel Corporation. Linux is a trademark of Linus Torvalds. NCR is a trademark of NCR Corporation. NFS is a trademark of Sun Microsystems, Inc. Oracle is a trademark of Oracle Corporation. Red Hat is a registered trademark and RPM is a trademark of Red Hat, Inc. SuSE is a trademark of SuSE Inc. TurboLinux is a trademark of TurboLinux, Inc. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Ltd. SGI™ ProPack 1.3 for Linux™ Start Here Document Number 007-4062-005 Contents List of Tables v About This Guide vii Reader Comments vii 1. Release Features 1 Feature Overview 2 Qualified Drivers 3 Patches and Changes to Base Linux Distributions 3 2.
    [Show full text]
  • Mobile Video Ecosystem and Geofencing for Licensed Content Delivery
    Mobile Video Ecosystem and Geofencing for Licensed Content Delivery TABLE OF CONTENTS Executive Summary ...................................................................................................................................... 2 1. Introduction ............................................................................................................................................... 4 2. Video Delivery Challenges and Optimizations .......................................................................................... 6 2.1 Delivery technology ............................................................................................................................. 7 2.2 Video Codecs ...................................................................................................................................... 9 2.3 Codec comparison ............................................................................................................................... 9 2.4 Congestion Control Algorithm Enhancements .................................................................................. 11 3. Managing Video Based on Network Load............................................................................................... 12 3.1 Radio Congestion Aware Function (RCAF) ...................................................................................... 12 3.2 Mobile Throughput Guidance ............................................................................................................ 13 3.3 Self Organizing
    [Show full text]
  • Buffered FUSE: Optimising the Android IO Stack for User-Level Filesystem
    Int. J. Embedded Systems, Vol. 6, Nos. 2/3, 2014 95 Buffered FUSE: optimising the Android IO stack for user-level filesystem Sooman Jeong and Youjip Won* Hanyang University, #507-2, Annex of Engineering Center, 17 Haengdang-dong, Sungdong-gu, Seoul, 133-791, South Korea E-mail: [email protected] E-mail: [email protected] *Corresponding author Abstract: In this work, we optimise the Android IO stack for user-level filesystem. Android imposes user-level filesystem over native filesystem partition to provide flexibility in managing the internal storage space and to maintain host compatibility. The overhead of user-level filesystem is prohibitively large and the native storage bandwidth is significantly under-utilised. We overhauled the FUSE layer in the Android platform and propose buffered FUSE (bFUSE) to address the overhead of user-level filesystem. The key technical ingredients of buffered FUSE are: 1) extended FUSE IO size; 2) internal user-level write buffer; 3) independent management thread which performs time-driven FUSE buffer synchronisation. With buffered FUSE, we examined the performances of five different filesystems and three disk scheduling algorithms in a combinatorial manner. With bFUSE on XFS filesystem using the deadline scheduling, we achieved the IO performance improvements of 470% and 419% in Android ICS and JB, respectively, over the existing smartphone device. Keywords: Android; storage; user-level filesystem; FUSE; write buffer; embedded systems. Reference to this paper should be made as follows: Jeong, S. and Won, Y. (2014) ‘Buffered FUSE: optimising the Android IO stack for user-level filesystem’, Int. J. Embedded Systems, Vol. 6, Nos.
    [Show full text]
  • Performance Characteristics of VMFS and RDM Vmware ESX Server 3.0.1
    Performance Study Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine—VMware Virtual Machine File System (VMFS), virtual raw device mapping (RDM), and physical raw device mapping. It is very important to understand the I/O characteristics of these disk access management systems in order to choose the right access type for a particular application. Choosing the right disk access management method can be a key factor in achieving high system performance for enterprise‐class applications. This study provides performance characterization of the various disk access management methods supported by VMware ESX Server. The goal is to provide data on performance and system resource utilization at various load levels for different types of work loads. This information offers you an idea of relative throughput, I/O rate, and CPU efficiency for each of the options so you can select the appropriate disk access method for your application. This study covers the following topics: “Technology Overview” on page 2 “Executive Summary” on page 2 “Test Environment” on page 2 “Performance Results” on page 5 “Conclusion” on page 10 “Configuration” on page 10 “Resources” on page 11 Copyright © 2007 VMware, Inc. All rights reserved. 1 Performance Characteristics of VMFS and RDM Technology Overview VMFS is a special high‐performance file system offered by VMware to store ESX Server virtual machines. Available as part of ESX Server, it is a distributed file system that allows concurrent access by multiple hosts to files on a shared VMFS volume.
    [Show full text]
  • Most Common Events and Errors in Actifio VDP 9.0.X
    Tech Brief Most Common Events and Errors in Actifio VDP 9.0.x The number of errors and warnings encountered by an Actifio appliance are displayed in the upper right-hand corner of the Actifio Desktop Dashboard. Click on the number to display a list of the events in the System Monitor service. This is a list of the most common Event IDs that you may see. You can find detailed solutions for these error codes are in ActifioNOW at: https://actifio.force.com/community/articles/Top_Solution/EventID-TopSolution Job failures can be caused by many errors. Each 43901 event message includes an error code and an error message. See the corresponding Error Code in Most Common Errors that Cause Events, Actifio VDP 9.0.x on page 9. Most Common Error and Warning Events, Actifio VDP 9.0.x Event ID Event Message What To Do 10019 System resource low The target Actifio pool (snapshot or dedup) is running out of space. See the AGM online help for a wealth of information on optimizing your storage pool usage and on optimizing image capture and dedup. 10026 A suitable MDisk or drive A quorum disk is a managed drive that contains a reserved area for for use as a quorum disk data, used exclusively for system management. Both Actifio nodes was not found. A FC path communicate on a constant basis with the three quorum disks from CDS nodes to a configured in the appliance. When one of the quorum disks is not target device has been available to any node, an error message is generated.
    [Show full text]
  • Opennebula 5.7 Deployment Guide Release 5.7.80
    OpenNebula 5.7 Deployment guide Release 5.7.80 OpenNebula Systems Jan 16, 2019 This document is being provided by OpenNebula Systems under the Creative Commons Attribution-NonCommercial- Share Alike License. THE DOCUMENT IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IM- PLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DOCUMENT. i CONTENTS 1 Cloud Design 1 1.1 Overview.................................................1 1.2 Open Cloud Architecture.........................................2 1.3 VMware Cloud Architecture.......................................7 1.4 OpenNebula Provisioning Model.................................... 13 2 OpenNebula Installation 19 2.1 Overview................................................. 19 2.2 Front-end Installation.......................................... 19 2.3 MySQL Setup.............................................. 25 3 Node Installation 27 3.1 Overview................................................. 27 3.2 KVM Node Installation......................................... 28 3.3 LXD Node Installation.......................................... 35 3.4 vCenter Node Installation........................................ 37 3.5 Verify your Installation.......................................... 45 4 Authentication Setup 52
    [Show full text]
  • Network Interface Controller Drivers Release 20.08.0
    Network Interface Controller Drivers Release 20.08.0 Aug 08, 2020 CONTENTS 1 Overview of Networking Drivers1 2 Features Overview4 2.1 Speed capabilities.....................................4 2.2 Link status.........................................4 2.3 Link status event......................................4 2.4 Removal event.......................................5 2.5 Queue status event.....................................5 2.6 Rx interrupt.........................................5 2.7 Lock-free Tx queue....................................5 2.8 Fast mbuf free.......................................5 2.9 Free Tx mbuf on demand..................................6 2.10 Queue start/stop......................................6 2.11 MTU update........................................6 2.12 Jumbo frame........................................6 2.13 Scattered Rx........................................6 2.14 LRO............................................7 2.15 TSO.............................................7 2.16 Promiscuous mode.....................................7 2.17 Allmulticast mode.....................................8 2.18 Unicast MAC filter.....................................8 2.19 Multicast MAC filter....................................8 2.20 RSS hash..........................................8 2.21 Inner RSS..........................................8 2.22 RSS key update.......................................9 2.23 RSS reta update......................................9 2.24 VMDq...........................................9 2.25 SR-IOV...........................................9
    [Show full text]
  • OSD-Btrfs, a Novel Lustre OSD Based on Btrfs
    2015/04/15 OSD-Btrfs, A Novel Lustre OSD based on Btrfs Shuichi Ihara Li Xi DataDirect Networks, Inc DataDirect Networks, Inc ©2015 DataDirect Networks. All Rights Reserved. ddn.com Why Btrfs for Lustre? ► Lustre today uses ldiskfs(ext4) or, more recently, ZFS as backend file system ► Ldiskfs(ext4) • ldiskfs(ext4) is used widely and exhibits well-tuned performance and proven reliability • But, its technical limitations are increasingly apparent, it is missing important novel features, lacks scalability, while offline fsck is extremely time-consuming ► ZFS • ZFS is an extremely feature-rich file system • but, for use as Lustre OSD further performance optimization is required • Licensing and potential legal issues have prevented a broader distribution ► Btrfs • Btrfs shares many features and goals with ZFS • Btrfs is becoming sufficiently stable for production usage • Btrfs is available through the major mainstream Linux distributions • Pachless server support. Simple Lustre setup and easy maintenance 2 ©2015 DataDirect Networks. All Rights Reserved. ddn.com Attractive Btrfs Features for Lustre (1) ► Features that improve OSD internally • Integrated multiple device support: RAID 0/1/10/5/6 o RAID 0 results in less OSTs per OSS and simplifies management, while improving the performance characteristics of a single OSD o RAID 1/10/5/6 improves reliability of OSDs • Dynamic inode allocation o No need to worry about formatting the MDT with wrong option to cause insufficient inode number • crc32c checksums on data and metadata o Improves OSD reliability • Online filesystem check and very fast offline filesystem check (planed) o Drives are getting bigger, but not faster, which leads to increasing fsck times 3 ©2015 DataDirect Networks.
    [Show full text]
  • Red Hat Enterprise Linux 7 Kernel Administration Guide
    Red Hat Enterprise Linux 7 Kernel Administration Guide Examples of Tasks for Managing the Kernel Last Updated: 2018-05-21 Red Hat Enterprise Linux 7 Kernel Administration Guide Examples of Tasks for Managing the Kernel Marie Dolezelova Red Hat Customer Content Services [email protected] Mark Flitter Red Hat Customer Content Services Douglas Silas Red Hat Customer Content Services Eliska Slobodova Red Hat Customer Content Services Jaromir Hradilek Red Hat Customer Content Services Maxim Svistunov Red Hat Customer Content Services Robert Krátký Red Hat Customer Content Services Stephen Wadeley Red Hat Customer Content Services Florian Nadge Red Hat Customer Content Services Legal Notice Copyright © 2018 Red Hat, Inc. The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/3.0/ . In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version. Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law. Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux ® is the registered trademark of Linus Torvalds in the United States and other countries. Java ® is a registered trademark of Oracle and/or its affiliates.
    [Show full text]
  • An Efficient and Flexible Metadata Management Layer for Local File Systems
    An Efficient and Flexible Metadata Management Layer for Local File Systems Yubo Liu1;2, Hongbo Li1, Yutong Lu1, Zhiguang Chen∗;1, Ming Zhao∗;2 1Sun Yat-sen University, 2Arizona State University fyliu789, [email protected], fhongbo.li, yutong.lu, [email protected] Abstract—The efficiency of metadata processing affects the file traditional file systems are difficult to fully utilize the metadata system performance significantly. There are two bottlenecks in locality. metadata management in existing local file systems: 1) Path The path lookup bottleneck also exists in KV-based file lookup is costly because it causes a lot of disk I/Os, which makes metadata operations inefficient. 2) Existing file systems systems, such as TableFS [11] and BetrFS [12], which store have deep I/O stack in metadata management, resulting in metadata in write-optimized KV databases. The reasons for the additional processing overhead. To solve these two bottlenecks, inefficiency of path lookup in these systems are: 1) Queries we decoupled data and metadata management and proposed are slow in the write-optimized databases in some cases. a metadata management layer for local file systems. First, we Most write-optimized databases optimize disk I/O by caching separated the metadata based on their locations in the namespace tree and aggregated the metadata into fixed-size metadata buckets write operations, but additional overhead (e.g. multi-level table (MDBs). This design fully utilizes the metadata locality and query and value update) is required for queries. In addition, di- improves the efficiency of disk I/O in the path lookup. Second, rectory/file entry lookups require logarithmic time complexity we customized an efficient MDB storage system on the raw to complete in these write-optimized index (like LSM-Tree storage device.
    [Show full text]