Efficient, Scalable, and Versatile Application and System Transaction

Total Page:16

File Type:pdf, Size:1020Kb

Efficient, Scalable, and Versatile Application and System Transaction Efficient, Scalable, and Versatile Application and System Transaction Management for Direct Storage Layers A Dissertation Presented by Richard Paul Spillane to The Graduate School in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Computer Science Stony Brook University May 2012 Copyright by Richard Paul Spillane 2012 Stony Brook University The Graduate School Richard Paul Spillane We, the dissertation committee for the above candidate for the Doctor of Philosophy degree, hereby recommend acceptance of this dissertation. Erez Zadok–Dissertation Advisor Associate Professor, Computer Science Department Robert Johnson–Chairperson of Defense Assistant Professor, Computer Science Department Donald Porter–Third Inside member Assistant Professor, Computer Science Department Dr. Margo I. Seltzer–Outside member Herchel Smith Professor, Computer Science, Harvard University This dissertation is accepted by the Graduate School Charles Taber Interim Dean of the Graduate School ii Abstract of the Dissertation Efficient, Scalable, and Versatile Application and System Transaction Management for Direct Storage Layers by Richard Paul Spillane Doctor of Philosophy in Computer Science Stony Brook University 2012 A good storage system provides efficient, flexible, and expressive abstractions that allow for more concise and non-specific code to be written at the application layer. However, because I/O operations can differ dramatically in performance, a variety of storage system designs have evolved differently to handle specific kinds of workloads and provide different sets of abstractions. For ex- ample, to overcome the gap between random and sequential I/O, databases, file systems, NoSQL database engines, and transaction managers have all come to rely on different on-storage data struc- tures. Therefore, they exhibit different transaction manager designs, offer different or incompatible abstractions, and perform well only for a subset of the universe of important workloads. Researchers have worked to coordinate and unify the various and different storage system de- signs used in operating systems. They have done so by porting useful abstractions from one system to another, such as by adding transactions to file systems. They have also done so by increasing the access and efficiency of important interfaces, such as by adding a write-ordering system call. These approaches are useful and valid, but rarely result in major changes to the canonical storage stack because the need for various storage systems to have different data structures for specific workloads ultimately remains. In this thesis, we find that the discrepancy and resulting complexity between various storage systems can be reduced by reducing the difference in their underlying data structures and transac- tional designs. This thesis explores two different designs of a consolidated storage system: one that extends file systems with transactions, and one that extends a widely used database data struc- ture with better support for scaling out to multiple devices, and transactions. Both efforts result in contributions to file systems and database design. Based in part on lessons learned from these experiences, we determined that to significantly reduce system complexity and unnecessary over- head, we must create transactional data structures with support for a wider range of workloads. We have designed a generalization of the log-structured merge-tree (LSM-tree) and coupled it with two iii novel extensions aimed to improve performance. Our system can perform sequential or file sys- tem workloads 2–5× faster than existing LSM-trees because of algorithmic differences and works at 1–2× the speed of unmodified LSM-trees for random workloads because of transactional log- ging differences. Our system has comparable performance to a pass-through FUSE file system and superior performance and flexibility compared to the storage engine layers of the widely used Cas- sandra and HBase systems; moreover, like log-structured file systems and databases, it eliminates unnecessary I/O writes, writing only once when performing I/O-bound asynchronous transactional workloads. It is our thesis that by extending the LSM-tree, we can create a viable and new alternative to the traditional read-optimized file system design that efficiently performs both random database and sequential file system workloads. A more flexible storage system can decrease demand for a plethora of specialized storage designs and consequently improve overall system design simplic- ity while supporting more powerful abstractions such as efficient file and key-value storage and transactions. iv To God: for his creation, to Sandra: the bedrock of my house, and to Mom, Dad, Sean, and my family and friends: for raising, loving, and teaching me. v Table of Contents List of Figures xi List of Tables xii Acknowledgments xiv 1 Introduction 1 2 Motivation 4 Motivating Factors . .5 2.1 Important Abstractions: System Transactions and Key-value Storage . .5 2.1.1 Transactions . .6 Transactional Model . .7 2.1.2 Key-Value Storage . .9 2.2 Performance . 10 2.3 Implementation Simplicity . 10 3 Background 12 3.1 Transactional Storage Systems . 13 3.1.1 WAL and Performance Issues . 13 3.1.2 Journaling File Systems . 15 3.1.3 Database Access APIs and Alternative Storage Regimes . 17 Soft-updates and Featherstitch . 17 Integration of LFS Journal for Database Applications . 17 3.2 Explaining LSM-trees and Background . 18 3.2.1 The DAM Model and B-trees . 19 3.2.2 LSM-tree Operation and Minor/Major Compaction . 21 3.2.3 Other Log-structured Database Data Structures . 23 3.3 Trial Designs . 24 3.3.1 SchemaFS: Building an FS Using Berkeley DB . 25 3.3.2 User-Space LSMFS: Building an FS on LSMs in User-Space with Minimal Overhead . 26 3.4 Putting it Together . 27 3.5 Conclusion . 29 vi 4 Valor: Enabling System Transactions with Lightweight Kernel Extensions 30 4.1 Background . 32 4.1.1 Database on an LFS Journal and Database FSes . 32 4.1.2 Database Access APIs . 33 Berkeley DB . 33 Stasis . 34 4.2 Design and Implementation . 34 Transactional Model . 34 1. Logging Device . 37 2. Simple Write Ordering . 37 3. Extended Mandatory Locking . 37 4. Interception Mechanism . 37 Cooperating with the Kernel Page Cache . 38 An Example . 40 4.2.1 The Logging Interface . 40 In-Memory Data Structures . 41 Life Cycle of a Transaction . 41 Soft vs. Hard Deallocations . 43 On-Disk Data Structures . 43 Transition Value Logging . 43 LDST: Log Device State Transition . 44 Atomicity . 45 Performing Recovery . 45 System Crash Recovery . 45 Process Crash Recovery . 45 4.2.2 Ensuring Isolation . 45 4.2.3 Application Interception . 46 4.3 Evaluation . 47 4.3.1 Experimental Setup . 47 Comparison to Berkeley DB and Stasis . 47 Berkeley DB Expectations . 48 4.3.2 Mock ARIES Lower Bound . 50 4.3.3 Serial Overwrite . 51 4.3.4 Concurrent Writers . 52 4.3.5 Recovery . 53 4.4 Conclusions . 54 5 GTSSLv1: A Fast-inserting Multi-tier Database for Tablet Serving 57 5.1 Background . 59 5.2 TSSL Compaction Analysis . 60 HBase Analysis . 60 SAMT Analysis . 62 Comparison . 62 5.3 Design and Implementation . 63 5.3.1 SAMT Multi-Tier Extensions . 63 vii Re-Insertion Caching . 64 Space Management and Reclamation . 65 5.3.2 Committing and Stacked Caching . 66 Cache Stacking . 66 Buffer Caching . 66 5.3.3 Transactional Support . 67 Snapshot, Truncate, and Recovery . 68 5.4 Evaluation . 69 5.4.1 Experimental Setup . 69 5.4.2 Multi-Tier Storage . 70 Configuration . 70 Results . 71 5.4.3 Read-Write Trade-off . 72 Configuration . 72 Results . 73 Cassandra and HBase limiting factors . 74 5.4.4 Cross-Tree Transactions . 75 Configuration . 75 Results . 76 5.4.5 Deduplication . 77 Configuration . 77 Results . 77 Evaluation summary . 77 5.5 Related Work . 78 (1) Cluster Evaluation . 78 (2) Multi-tier storage . 78 (3) Hierarchical Storage Management . 79 (4) Multi-level caching . 79 (5) Write-optimized trees . ..
Recommended publications
  • Copy on Write Based File Systems Performance Analysis and Implementation
    Copy On Write Based File Systems Performance Analysis And Implementation Sakis Kasampalis Kongens Lyngby 2010 IMM-MSC-2010-63 Technical University of Denmark Department Of Informatics Building 321, DK-2800 Kongens Lyngby, Denmark Phone +45 45253351, Fax +45 45882673 [email protected] www.imm.dtu.dk Abstract In this work I am focusing on Copy On Write based file systems. Copy On Write is used on modern file systems for providing (1) metadata and data consistency using transactional semantics, (2) cheap and instant backups using snapshots and clones. This thesis is divided into two main parts. The first part focuses on the design and performance of Copy On Write based file systems. Recent efforts aiming at creating a Copy On Write based file system are ZFS, Btrfs, ext3cow, Hammer, and LLFS. My work focuses only on ZFS and Btrfs, since they support the most advanced features. The main goals of ZFS and Btrfs are to offer a scalable, fault tolerant, and easy to administrate file system. I evaluate the performance and scalability of ZFS and Btrfs. The evaluation includes studying their design and testing their performance and scalability against a set of recommended file system benchmarks. Most computers are already based on multi-core and multiple processor architec- tures. Because of that, the need for using concurrent programming models has increased. Transactions can be very helpful for supporting concurrent program- ming models, which ensure that system updates are consistent. Unfortunately, the majority of operating systems and file systems either do not support trans- actions at all, or they simply do not expose them to the users.
    [Show full text]
  • Membrane: Operating System Support for Restartable File Systems Swaminathan Sundararaman, Sriram Subramanian, Abhishek Rajimwale, Andrea C
    Membrane: Operating System Support for Restartable File Systems Swaminathan Sundararaman, Sriram Subramanian, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift Computer Sciences Department, University of Wisconsin, Madison Abstract and most complex code bases in the kernel. Further, We introduce Membrane, a set of changes to the oper- file systems are still under active development, and new ating system to support restartable file systems. Mem- ones are introduced quite frequently. For example, Linux brane allows an operating system to tolerate a broad has many established file systems, including ext2 [34], class of file system failures and does so while remain- ext3 [35], reiserfs [27], and still there is great interest in ing transparent to running applications; upon failure, the next-generation file systems such as Linux ext4 and btrfs. file system restarts, its state is restored, and pending ap- Thus, file systems are large, complex, and under develop- plication requests are serviced as if no failure had oc- ment, the perfect storm for numerous bugs to arise. curred. Membrane provides transparent recovery through Because of the likely presence of flaws in their imple- a lightweight logging and checkpoint infrastructure, and mentation, it is critical to consider how to recover from includes novel techniques to improve performance and file system crashes as well. Unfortunately, we cannot di- correctness of its fault-anticipation and recovery machin- rectly apply previous work from the device-driver litera- ery. We tested Membrane with ext2, ext3, and VFAT. ture to improving file-system fault recovery. File systems, Through experimentation, we show that Membrane in- unlike device drivers, are extremely stateful, as they man- duces little performance overhead and can tolerate a wide age vast amounts of both in-memory and persistent data; range of file system crashes.
    [Show full text]
  • The New Standard for Data Archiving
    The new standard for data archiving 1 Explosively increasing 2 Why data management digital data is important Ever-increasing volumes of digital data are mounting up every day due to rapidly growing Today, files can be lost from computers in any number of ways—you might accidentally internet technology, widespread use of SNS, data transmission between network- connected delete a file, a virus might wipe one out, or there could be a complete hard drive failure. devices, and other trends. Within the video production industry, data-heavy video content When a hard drive dies an untimely death, it can feel like a house has burnt down. (for example, 4K, 8K, and 4K/8K high-frame-rate video) is becoming a major source of video Important personal items are usually gone forever—photos, significant documents, broadcasting. Many companies and research institutes are creating high volumes of data downloaded music, and more. (big data) for use in AI systems. Somehow, these newly created assets need to be managed effectively, stored safely, and utilized along with the old assets. There are many options for backing up content without any sophisticated equipment—you can 163,000 EB use DVDs, external hard drives, optical discs, or even online storage. It’s a good idea to back up data to multiple places. Computer Natural Viruses Disasters 4% Other 2% 40,026 EB 3% 2,837 EB 8,591 EB 1,227 EB Software Corruption 9% 2010 2012 2015 2020 2025 System/ Hardware 1 EB = 1,000,000 TB Malfunction 56% User Error Performance 26% Source: Ontrack Data Recovery High Performance www.ontrack.co.uk/understandingdataloss E-commerce, Financial HOT SSD RAM Capacity Optimized Back O­ce, General Service WARM HDD Sony's Optical Disc Archive storage system offers the solution, Long-Term Archive COLD with a low total cost of ownership through the use of long-life Auto Loader, O-line Tape Optical Disc Cost media, and it includes inter-generational compatibility based on the same optical disc technology used in DVDs and Blu-ray discs.
    [Show full text]
  • Restoring in All Situations
    Restoring in all situations Desktop and laptop protection 1 Adapt data restore methods to different situations Executive Summary Whether caused by human error, a cyber attack, or a physical disaster, data loss costs organizations millions of euros every year (3.5 million euros on average according to the Ponemon Institute). Backup Telecommuting Office remains the method of choice to protect access to company data and ensure their availability. But a good backup strategy must necessarily be accompanied by a good disaster recovery strategy. Nomadization The purpose of this white paper is to present the best restore practices depending on the situation you are in, in order to save valuable time! VMs / Servers Apps & DBs NAS Laptops Replication 2 Table of Contents EXECUTIVE SUMMARY .......................................................................................................................... 2 CONTEXT ................................................................................................................................................ 4 WHAT ARE THE STAKES BEHIND THE RESTORING ENDPOINT USER DATA? ................................. 5 SOLUTION: RESTORING ENDPOINT DATA USING LINA ................................................................... 6-7 1. RESTORE FOR THE MOST NOVICE USERS ................................................................................ 8-10 2. RESTORE FOR OFF SITE USERS ................................................................................................11-13 3. RESTORE FOLLOWING LOSS,
    [Show full text]
  • IBM Connect:Direct for Microsoft Windows: Documentation Fixpack 1 (V6.1.0.1)
    IBM Connect:Direct for Microsoft Windows 6.1 Documentation IBM This edition applies to Version 5 Release 3 of IBM® Connect:Direct and to all subsequent releases and modifications until otherwise indicated in new editions. © Copyright International Business Machines Corporation 1993, 2018. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Chapter 1. Release Notes.......................................................................................1 Requirements...............................................................................................................................................1 Features and Enhancements....................................................................................................................... 2 Special Considerations................................................................................................................................ 3 Known Restrictions...................................................................................................................................... 4 Restrictions for Connect:Direct for Microsoft Windows........................................................................ 4 Restrictions for Related Software.......................................................................................................... 6 Installation Notes.........................................................................................................................................6
    [Show full text]
  • A General-Purpose High Accuracy and Space Efficient
    countBF: A General-purpose High Accuracy and Space Efficient Counting Bloom Filter 1st Sabuzima Nayak 2nd Ripon Patgiri, Senior Member, IEEE Dept. Computer Science & Engineering Dept. Computer Science & Engineering National Institute of Technology Silchar National Institute of Technology Silchar Cachar-788010, Assam, India Cachar-788010, Assam, India sabuzima [email protected] [email protected] Abstract—Bloom Filter is a probabilistic data structure for Delete operation is crucial for Bloom Filter. Suppose a the membership query, and it has been intensely experimented database management system integrates Bloom Filter to avoid in various fields to reduce memory consumption and enhance unnecessary disk accesses and enhance its performance sig- a system’s performance. Bloom Filter is classified into two key categories: counting Bloom Filter (CBF), and non-counting nificantly with a tiny amount of memory [18]. Thus, the Bloom Filter. CBF has a higher false positive probability than database management system requires to insert, query, and standard Bloom Filter (SBF), i.e., CBF uses a higher memory delete operation for which conventional Bloom Filter is not footprint than SBF. But CBF can address the issue of the false suitable. Therefore, counting Bloom Filter is used in such negative probability. Notably, SBF is also false negative free, but kind of requirements. Another example is a network router. it cannot support delete operations like CBF. To address these issues, we present a novel counting Bloom Filter based on SBF Older packets are removed from the router databases. There- and 2D Bloom Filter, called countBF. countBF uses a modified fore, the router requires counting Bloom Filter.
    [Show full text]
  • A Survey of Distributed File Systems
    A Survey of Distributed File Systems M. Satyanarayanan Department of Computer Science Carnegie Mellon University February 1989 Abstract Abstract This paper is a survey of the current state of the art in the design and implementation of distributed file systems. It consists of four major parts: an overview of background material, case studies of a number of contemporary file systems, identification of key design techniques, and an examination of current research issues. The systems surveyed are Sun NFS, Apollo Domain, Andrew, IBM AIX DS, AT&T RFS, and Sprite. The coverage of background material includes a taxonomy of file system issues, a brief history of distributed file systems, and a summary of empirical research on file properties. A comprehensive bibliography forms an important of the paper. Copyright (C) 1988,1989 M. Satyanarayanan The author was supported in the writing of this paper by the National Science Foundation (Contract No. CCR-8657907), Defense Advanced Research Projects Agency (Order No. 4976, Contract F33615-84-K-1520) and the IBM Corporation (Faculty Development Award). The views and conclusions in this document are those of the author and do not represent the official policies of the funding agencies or Carnegie Mellon University. 1 1. Introduction The sharing of data in distributed systems is already common and will become pervasive as these systems grow in scale and importance. Each user in a distributed system is potentially a creator as well as a consumer of data. A user may wish to make his actions contingent upon information from a remote site, or may wish to update remote information.
    [Show full text]
  • File Server Scaling with Network-Attached Secure Disks
    Appears in Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (Sigmetrics ‘97), Seattle, Washington, June 15-18, 1997. File Server Scaling with Network-Attached Secure Disks Garth A. Gibson†, David F. Nagle*, Khalil Amiri*, Fay W. Chang†, Eugene M. Feinberg*, Howard Gobioff†, Chen Lee†, Berend Ozceri*, Erik Riedel*, David Rochberg†, Jim Zelenka† *Department of Electrical and Computer Engineering †School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213-3890 [email protected] http://www.cs.cmu.edu/Web/Groups/NASD/ Abstract clients with efficient, scalable, high-bandwidth access to stored data. This paper discusses a powerful approach to fulfilling this By providing direct data transfer between storage and client, net- need. Network-attached storage provides high bandwidth by work-attached storage devices have the potential to improve scal- directly attaching storage to the network, avoiding file server ability for existing distributed file systems (by removing the server store-and-forward operations and allowing data transfers to be as a bottleneck) and bandwidth for new parallel and distributed file striped over storage and switched-network links. systems (through network striping and more efficient data paths). The principal contribution of this paper is to demonstrate the Together, these advantages influence a large enough fraction of the potential of network-attached storage devices for penetrating the storage market to make commodity network-attached storage fea- markets defined by existing distributed file system clients, specifi- sible. Realizing the technology’s full potential requires careful cally the Network File System (NFS) and Andrew File System consideration across a wide range of file system, networking and (AFS) distributed file system protocols.
    [Show full text]
  • Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO
    Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO..........................................................................................................................................1 Martin Hinner < [email protected]>, http://martin.hinner.info............................................................1 1. Introduction..........................................................................................................................................1 2. Volumes...............................................................................................................................................1 3. DOS FAT 12/16/32, VFAT.................................................................................................................2 4. High Performance FileSystem (HPFS)................................................................................................2 5. New Technology FileSystem (NTFS).................................................................................................2 6. Extended filesystems (Ext, Ext2, Ext3)...............................................................................................2 7. Macintosh Hierarchical Filesystem − HFS..........................................................................................3 8. ISO 9660 − CD−ROM filesystem.......................................................................................................3 9. Other filesystems.................................................................................................................................3
    [Show full text]
  • SGI™ Propack 1.3 for Linux™ Start Here
    SGI™ ProPack 1.3 for Linux™ Start Here Document Number 007-4062-005 © 1999—2000 Silicon Graphics, Inc.— All Rights Reserved The contents of this document may not be copied or duplicated in any form, in whole or in part, without the prior written permission of Silicon Graphics, Inc. LIMITED AND RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure by the Government is subject to restrictions as set forth in the Rights in Data clause at FAR 52.227-14 and/or in similar or successor clauses in the FAR, or in the DOD, DOE or NASA FAR Supplements. Unpublished rights reserved under the Copyright Laws of the United States. Contractor/ manufacturer is SGI, 1600 Amphitheatre Pkwy., Mountain View, CA 94043-1351. Silicon Graphics is a registered trademark and SGI and SGI ProPack for Linux are trademarks of Silicon Graphics, Inc. Intel is a trademark of Intel Corporation. Linux is a trademark of Linus Torvalds. NCR is a trademark of NCR Corporation. NFS is a trademark of Sun Microsystems, Inc. Oracle is a trademark of Oracle Corporation. Red Hat is a registered trademark and RPM is a trademark of Red Hat, Inc. SuSE is a trademark of SuSE Inc. TurboLinux is a trademark of TurboLinux, Inc. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Ltd. SGI™ ProPack 1.3 for Linux™ Start Here Document Number 007-4062-005 Contents List of Tables v About This Guide vii Reader Comments vii 1. Release Features 1 Feature Overview 2 Qualified Drivers 3 Patches and Changes to Base Linux Distributions 3 2.
    [Show full text]
  • State of the Art: Where We Are with the Ext3 Filesystem
    State of the Art: Where we are with the Ext3 filesystem Mingming Cao, Theodore Y. Ts’o, Badari Pulavarty, Suparna Bhattacharya IBM Linux Technology Center {cmm, theotso, pbadari}@us.ibm.com, [email protected] Andreas Dilger, Alex Tomas, Cluster Filesystem Inc. [email protected], [email protected] Abstract 1 Introduction Although the ext2 filesystem[4] was not the first filesystem used by Linux and while other filesystems have attempted to lay claim to be- ing the native Linux filesystem (for example, The ext2 and ext3 filesystems on Linux R are when Frank Xia attempted to rename xiafs to used by a very large number of users. This linuxfs), nevertheless most would consider the is due to its reputation of dependability, ro- ext2/3 filesystem as most deserving of this dis- bustness, backwards and forwards compatibil- tinction. Why is this? Why have so many sys- ity, rather than that of being the state of the tem administrations and users put their trust in art in filesystem technology. Over the last few the ext2/3 filesystem? years, however, there has been a significant amount of development effort towards making There are many possible explanations, includ- ext3 an outstanding filesystem, while retaining ing the fact that the filesystem has a large and these crucial advantages. In this paper, we dis- diverse developer community. However, in cuss those features that have been accepted in our opinion, robustness (even in the face of the mainline Linux 2.6 kernel, including direc- hardware-induced corruption) and backwards tory indexing, block reservation, and online re- compatibility are among the most important sizing.
    [Show full text]
  • Linux 2.5 Kernel Developers Summit
    conference reports This issue’s reports are on the Linux 2.5 Linux 2.5 Kernel Developers Linux development, but I certainly Kernel Developers Summit Summit thought that, in all of this time, someone would have brought this group together OUR THANKS TO THE SUMMARIZER: SAN JOSE, CALIFORNIA before. Rik Farrow, with thanks to La Monte MARCH 30-31, 2001 Yarroll and Chris Mason for sharing their Summarized by Rik Farrow Another difference appeared when the notes. first session started on Friday morning. The purpose of this workshop was to The conference room was set up with cir- provide a forum for discussion of cular tables, each with power strips for changes to be made in the 2.5 release of For additional information on the Linux laptops, and only a few attendees were Linux (a trademark of Linus Torvalds). I not using a laptop. USENIX had pro- 2.5 Kernel Developers Summit, see the assume that many people reading this vided Aeronet wireless setup via the following sites: will be familiar with Linux, and I will hotel’s T1 link, and people were busy <http://lwn.net/2001/features/KernelSummit/> attempt to explain things that might be typing and compiling. Chris Mason of unfamiliar to others. That said, the odd- <http://cgi.zdnet.com/slink?91362:12284618> OSDN noticed that Dave Miller had numbered releases, like 2.3 and now 2.5, <http://www.osdn.com/conferences/kernel/> written a utility to modulate the speed of are development releases where the the CPU fans based upon the tempera- intent is to try out new features or make ture reading from his motherboard.
    [Show full text]