File Systems and Storage

Total Page:16

File Type:pdf, Size:1020Kb

File Systems and Storage ;login JUNE 2014 VOL. 39, NO. 3 : File Systems and Storage & Ceph as Used in the ~okeanos Cloud Filippos Giannakos, Vangelis Koukis, Constantinos Venetsanopoulos, and Nectarios Koziris & OpenStack Explained Mark Lamourine & HDFS Under HBase: Layering and Performance Tyler Harter, Dhruba Borthakur, Siying Dong, Amitanand Aiyer, Liyin Tang, Andrea C. Arpaci- Dusseau, and Remzi H. Arpaci-Dusseau Columns Practical Perl Tools: Using 0MQ, Part 1 David N. Blank-Edelman Python: Asynchronous I/O David Beazley iVoyeur: Monitoring Design Patterns Dave Josephsen For Good Measure: Exploring with a Purpose Dan Geer and Jay Jacobs /dev/random: Clouds and the Future of Advertising Robert G. Ferrell The Saddest Moment James Mickens Conference Reports FAST ’14: 12th USENIX Conference on File and Storage Technologies 2014 USENIX Research in Linux File and Storage Technologies Summit UPCOMING EVENTS 2014 USENIX Federated Conferences Week FOCI ’14: 4th USENIX Workshop on Free and Open June 17–20, 2014, Philadelphia, PA, USA Communications on the Internet USENIX ATC ’14: 2014 USENIX Annual Technical August 18, 2014 Conference www.usenix.org/foci14 June 19–20, 2014 HotSec ’14: 2014 USENIX Summit on Hot Topics www.usenix.org/atc14 in Security ICAC ’14: 11th International Conference on August 19, 2014 Autonomic Computing www.usenix.org/hotsec14 June 18–20, 2014 HealthTech ’14: 2014 USENIX Summit on Health www.usenix.org/icac14 Information Technologies HotCloud ’14: 6th USENIX Workshop on Safety, Security, Privacy, and Interoperability Hot Topics in Cloud Computing of Health Information Technologies June 17–18, 2014 August 19, 2014 www.usenix.org/hotcloud14 www.usenix.org/healthtech14 HotStorage ’14: 6th USENIX Workshop WOOT ’14: 8th USENIX Workshop on Offensive on Hot Topics in Storage and File Systems Technologies June 17–18, 2014 August 19, 2014 www.usenix.org/hotstorage14 www.usenix.org/woot14 9th International Workshop on Feedback Computing OSDI ’14: 11th USENIX Symposium on Operating June 17, 2014 www.usenix.org/feedbackcomputing14 Systems Design and Implementation October 6–8, 2014, Broomfield, CO, USA WiAC ’14: 2014 USENIX Women in Advanced www.usenix.org/osdi14 Computing Summit Co-located with OSDI ’14 and taking place October 5, 2014 June 18, 2014 www.usenix.org/wiac14 Diversity ’14: 2014 Workshop on Supporting Diversity UCMS ’14: 2014 USENIX Configuration Management in Systems Research www.usenix.org/diversity14 Summit June 19, 2014 HotDep ’14: 10th Workshop on Hot Topics in www.usenix.org/ucms14 Dependable Systems URES ’14: 2014 USENIX Release Engineering Summit www.usenix.org/hotdep14 Submissions due: July 10, 2014 June 20, 2014 www.usenix.org/ures14 HotPower ’14: 6th Workshop on Power-Aware Computing and Systems 23rd USENIX Security Symposium www.usenix.org/hotpower14 August 20–22, 2014, San Diego, CA, USA INFLOW ’14: 2nd Workshop on Interactions of NVM/ www.usenix.org/sec14 Flash with Operating Systems and Workloads Workshops Co-located with USENIX Security ’14 www.usenix.org/inflow14 EVT/WOTE ’14: 2014 Electronic Voting Technology Submissions due: July 1, 2014 Workshop/Workshop on Trustworthy Elections TRIOS ’14: 2014 Conference on Timely Results in August 18–19, 2014 Operating Systems www.usenix.org/evtwote14 www.usenix.org/trios14 USENIX Journal of Election Technology and Systems (JETS) LISA ’14 Published in conjunction with EVT/WOTE November 9–14, 2014, Seattle, WA, USA www.usenix.org/jets www.usenix.org/lisa14 CSET ’14: 7th Workshop on Cyber Security Co-located with LISA ’14: Experimentation and Test SESA ’14: 2014 USENIX Summit for Educators August 18, 2014 in System Administration www.usenix.org/cset14 November 11, 2014 3GSE ’14: 2014 USENIX Summit on Gaming, Games, and Gamification in Security Education August 18, 2014 www.usenix.org/3gse14 twitter.com/usenix www.usenix.org/youtube www.usenix.org/gplus Stay Connected... www.usenix.org/facebook www.usenix.org/linkedin www.usenix.org/blog JUNE 2014 VOL. 39, NO. 3 EDITOR Rik Farrow [email protected] OPINION COPY EDITORS Steve Gilmartin 2 Musings Rik Farrow Amber Ankerholz PRODUCTION MANAGER FILE SYSTEMS AND STORAGE Michele Nelson ~okeanos: Large-Scale Cloud Service Using Ceph 6 PRODUCTION Filippos Giannakos, Vangelis Koukis, Constantinos Venetsanopoulos, Arnold Gatilao and Nectarios Koziris Casey Henderson 10 Analysis of HDFS under HBase: A Facebook Messages TYPESETTER Star Type Case Study Tyler Harter, Dhruba Borthakur, Siying Dong, [email protected] Amitanand Aiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau USENIX ASSOCIATION 2560 Ninth Street, Suite 215 Berkeley, California 94710 CLOUD Phone: (510) 528-8649 18 OpenStack Mark Lamourine FAX: (510) 548-5738 www.usenix.org TROUBLESHOOTING ;login: is the official magazine of the USENIX 22 When the Stars Align, and When They Don’t Tim Hockin Association. ;login: (ISSN 1044-6397) is published bi-monthly by the USENIX Association, 2560 Ninth Street, Suite 215, SYSADMIN Berkeley, CA 94710. 24 Solving Rsyslog Performance Issues David Lang $90 of each member’s annual dues is for a /var/log/manager: Rock Stars and Shift Schedules Andy Seely subscription to ;login:. Subscriptions for non- 30 members are $90 per year. Periodicals postage paid at Berkeley, CA, and additional offices. COLUMNS POSTMASTER: Send address changes to 32 Practical Perl Tools: My Hero, Zero (Part 1) ;login:, USENIX Association, 2560 Ninth Street, David N. Blank-Edelman Suite 215, Berkeley, CA 94710. 38 Python Gets an Event Loop (Again) David Beazley ©2014 USENIX Association USENIX is a registered trademark of the 42 iVoyeur: Monitoring Design Patterns Dave Josephsen USENIX Association. Many of the designations Dan Geer and Jay Jacobs used by manufacturers and sellers to 47 Exploring with a Purpose distinguish their products are claimed 50 /dev/random Robert G. Ferrell as trademarks. USENIX acknowledges all trademarks herein. Where those designations 52 The Saddest Moment James Mickens appear in this publication and USENIX is aware of a trademark claim, the designations have BOOKS been printed in caps or initial caps. 56 2014 Elizabeth Zwicky USENIX NOTES 58 Notice of Annual Meeting 58 Results of the Election for the USENIX Board of Directors, 2014–2016 CONFERENCE REPORTS 59 FAST ’14: 12th USENIX Conference on File and Storage Technologies 76 Linux FAST Summit ’14 Musings RIK FARROWEDITORIAL Rik is the editor of ;login:. fter writing this column for more than 15 years, I’m a bit stuck with [email protected] what I should talk about. But I do have a couple of things on my mind: AKrste Asanović’s FAST ’14 keynote [1] and something Brendan Gregg wrote in his Systems Performance book [2]. Several years ago, I compared computer systems architecture to an assembly line in a fac- tory [3], where not having a part ready (some data) held up the entire assembly line. Brendan expressed this differently, in Table 2.2 of his book, where he compares the speed of a 3.3 GHz processor to other system components by using human timescales. If a single CPU cycle is represented by one second, instead of .3 nanoseconds, then fetching data from Level 1 cache takes three seconds, from Level 3 cache takes 43 seconds, and having to go to DRAM takes six minutes. Having to wait for a disk read takes months, and a fetch from a remote datacen- ter can take years. It’s amazing that anything gets done—but then you remember the scaling of 3.3 billion to one. Events occurring in less than a few tens of milliseconds apart appear simultaneous to us humans. Krste Asanović spoke about the ASPIRE Lab, where they are examining the shift from the performance increases we’ve seen in silicon to a post-scaling world, so we need to consider the entire hardware and software stack. You can read the summary of his talk in this issue of ;login: or go online and watch the video of his presentation. A lot of what Asanović talked about seemed familiar to me, because I had heard some of these ideas from the UCB Par Lab, in a paper in 2009 [4]. Some things were new, such as using photonic switching and message passing (read David Blank-Edelman’s column) instead of the typical CPU bus interconnects. Asanović also suggested that data be encrypted until it reached the core where it would be processed, an idea I had after hearing that Par Lab paper, and one that I’m glad someone else will actually do something about. Still other things remained the same, such as having many homogeneous cores for the bulk of data processing, with a handful of specialized cores for things like vector processing. Asanović, in the ques- tion and answer that followed his speech, said that only .01% of computing requires special- ized software and hardware. Having properly anticipated the need for encryption of data while at rest and even while being transferred by the photonic message passing system, I thought I’d take a shot at imagining the rest of Warehouse Scale Computing (WSC), but without borrowing from the ASPIRE FireBox design too much. The Need for Speed First off, you need to keep those swift cores happy, which means that data must always be ready nearby. That’s a tall order, and one that hardware designers have been aware of for many years. Photonic switching at one terabit per second certainly sounds nice, and it’s hard to imagine something that beats a design that already seems like science fiction based on the name. For my design, I will simply specify a message-passing network that connects all cores and their local caches to the wider world beyond. Like the FireBox design, there will be no Level 1 cache coherency or shared L1 caches. If a module running on one core wants to share 2 JUNE 2014 VOL. 39, NO. 3 www.usenix.org EDITORIAL Musings data with another, it will have to be done via message passing, My design has the outward appearance of a cube.
Recommended publications
  • NUMA-Aware Thread Migration for High Performance NVMM File Systems
    NUMA-Aware Thread Migration for High Performance NVMM File Systems Ying Wang, Dejun Jiang and Jin Xiong SKL Computer Architecture, ICT, CAS; University of Chinese Academy of Sciences fwangying01, jiangdejun, [email protected] Abstract—Emerging Non-Volatile Main Memories (NVMMs) out considering the NVMM usage on NUMA nodes. Besides, provide persistent storage and can be directly attached to the application threads accessing file system rely on the default memory bus, which allows building file systems on non-volatile operating system thread scheduler, which migrates thread only main memory (NVMM file systems). Since file systems are built on memory, NUMA architecture has a large impact on their considering CPU utilization. These bring remote memory performance due to the presence of remote memory access and access and resource contentions to application threads when imbalanced resource usage. Existing works migrate thread and reading and writing files, and thus reduce the performance thread data on DRAM to solve these problems. Unlike DRAM, of NVMM file systems. We observe that when performing NVMM introduces extra latency and lifetime limitations. This file reads/writes from 4 KB to 256 KB on a NVMM file results in expensive data migration for NVMM file systems on NUMA architecture. In this paper, we argue that NUMA- system (NOVA [47] on NVMM), the average latency of aware thread migration without migrating data is desirable accessing remote node increases by 65.5 % compared to for NVMM file systems. We propose NThread, a NUMA-aware accessing local node. The average bandwidth is reduced by thread migration module for NVMM file system.
    [Show full text]
  • Globalfs: a Strongly Consistent Multi-Site File System
    GlobalFS: A Strongly Consistent Multi-Site File System Leandro Pacheco Raluca Halalai Valerio Schiavoni University of Lugano University of Neuchatelˆ University of Neuchatelˆ Fernando Pedone Etienne Riviere` Pascal Felber University of Lugano University of Neuchatelˆ University of Neuchatelˆ Abstract consistency, availability, and tolerance to partitions. Our goal is to ensure strongly consistent file system operations This paper introduces GlobalFS, a POSIX-compliant despite node failures, at the price of possibly reduced geographically distributed file system. GlobalFS builds availability in the event of a network partition. Weak on two fundamental building blocks, an atomic multicast consistency is suitable for domain-specific applications group communication abstraction and multiple instances of where programmers can anticipate and provide resolution a single-site data store. We define four execution modes and methods for conflicts, or work with last-writer-wins show how all file system operations can be implemented resolution methods. Our rationale is that for general-purpose with these modes while ensuring strong consistency and services such as a file system, strong consistency is more tolerating failures. We describe the GlobalFS prototype in appropriate as it is both more intuitive for the users and detail and report on an extensive performance assessment. does not require human intervention in case of conflicts. We have deployed GlobalFS across all EC2 regions and Strong consistency requires ordering commands across show that the system scales geographically, providing replicas, which needs coordination among nodes at performance comparable to other state-of-the-art distributed geographically distributed sites (i.e., regions). Designing file systems for local commands and allowing for strongly strongly consistent distributed systems that provide good consistent operations over the whole system.
    [Show full text]
  • Unravel Data Systems Version 4.5
    UNRAVEL DATA SYSTEMS VERSION 4.5 Component name Component version name License names jQuery 1.8.2 MIT License Apache Tomcat 5.5.23 Apache License 2.0 Tachyon Project POM 0.8.2 Apache License 2.0 Apache Directory LDAP API Model 1.0.0-M20 Apache License 2.0 apache/incubator-heron 0.16.5.1 Apache License 2.0 Maven Plugin API 3.0.4 Apache License 2.0 ApacheDS Authentication Interceptor 2.0.0-M15 Apache License 2.0 Apache Directory LDAP API Extras ACI 1.0.0-M20 Apache License 2.0 Apache HttpComponents Core 4.3.3 Apache License 2.0 Spark Project Tags 2.0.0-preview Apache License 2.0 Curator Testing 3.3.0 Apache License 2.0 Apache HttpComponents Core 4.4.5 Apache License 2.0 Apache Commons Daemon 1.0.15 Apache License 2.0 classworlds 2.4 Apache License 2.0 abego TreeLayout Core 1.0.1 BSD 3-clause "New" or "Revised" License jackson-core 2.8.6 Apache License 2.0 Lucene Join 6.6.1 Apache License 2.0 Apache Commons CLI 1.3-cloudera-pre-r1439998 Apache License 2.0 hive-apache 0.5 Apache License 2.0 scala-parser-combinators 1.0.4 BSD 3-clause "New" or "Revised" License com.springsource.javax.xml.bind 2.1.7 Common Development and Distribution License 1.0 SnakeYAML 1.15 Apache License 2.0 JUnit 4.12 Common Public License 1.0 ApacheDS Protocol Kerberos 2.0.0-M12 Apache License 2.0 Apache Groovy 2.4.6 Apache License 2.0 JGraphT - Core 1.2.0 (GNU Lesser General Public License v2.1 or later AND Eclipse Public License 1.0) chill-java 0.5.0 Apache License 2.0 Apache Commons Logging 1.2 Apache License 2.0 OpenCensus 0.12.3 Apache License 2.0 ApacheDS Protocol
    [Show full text]
  • Using the Andrew File System on *BSD [email protected], Bsdcan, 2006 Why Another Network Filesystem
    Using the Andrew File System on *BSD [email protected], BSDCan, 2006 why another network filesystem 1-slide history of Andrew File System user view admin view OpenAFS Arla AFS on OpenBSD, FreeBSD and NetBSD Filesharing on the Internet use FTP or link to HTTP file interface through WebDAV use insecure protocol over vpn History of AFS 1984: developed at Carnegie Mellon 1989: TransArc Corperation 1994: over to IBM 1997: Arla, aimed at Linux and BSD 2000: IBM releases source 2000: foundation of OpenAFS User view <1> global filesystem rooted at /afs /afs/cern.ch/... /afs/cmu.edu/... /afs/gorlaeus.net/users/h/hugo/... User view <2> authentication through Kerberos #>kinit <username> obtain krbtgt/<realm>@<realm> #>afslog obtain afs@<realm> #>cd /afs/<cell>/users/<username> User view <3> ACL (dir based) & Quota usage runs on Windows, OS X, Linux, Solaris ... and *BSD Admin view <1> <cell> <partition> <server> <volume> <volume> <server> <partition> Admin view <2> /afs/gorlaeus.net/users/h/hugo/presos/afs_slides.graffle gorlaeus.net /vicepa fwncafs1 users hugo h bram <server> /vicepb Admin view <2a> /afs/gorlaeus.net/users/h/hugo/presos/afs_slides.graffle gorlaeus.net /vicepa fwncafs1 users hugo /vicepa fwncafs2 h bram Admin view <3> servers require KeyFile ~= keytab procedure differs for Heimdal: ktutil copy MIT: asetkey add Admin view <4> entry in CellServDB >gorlaeus.net #my cell name 10.0.0.1 <dbserver host name> required on servers required on clients without DynRoot Admin view <5> File locking no databases on AFS (requires byte range locking)
    [Show full text]
  • Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO
    Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO..........................................................................................................................................1 Martin Hinner < [email protected]>, http://martin.hinner.info............................................................1 1. Introduction..........................................................................................................................................1 2. Volumes...............................................................................................................................................1 3. DOS FAT 12/16/32, VFAT.................................................................................................................2 4. High Performance FileSystem (HPFS)................................................................................................2 5. New Technology FileSystem (NTFS).................................................................................................2 6. Extended filesystems (Ext, Ext2, Ext3)...............................................................................................2 7. Macintosh Hierarchical Filesystem − HFS..........................................................................................3 8. ISO 9660 − CD−ROM filesystem.......................................................................................................3 9. Other filesystems.................................................................................................................................3
    [Show full text]
  • Artificial Intelligence for Understanding Large and Complex
    Artificial Intelligence for Understanding Large and Complex Datacenters by Pengfei Zheng Department of Computer Science Duke University Date: Approved: Benjamin C. Lee, Advisor Bruce M. Maggs Jeffrey S. Chase Jun Yang Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2020 Abstract Artificial Intelligence for Understanding Large and Complex Datacenters by Pengfei Zheng Department of Computer Science Duke University Date: Approved: Benjamin C. Lee, Advisor Bruce M. Maggs Jeffrey S. Chase Jun Yang An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2020 Copyright © 2020 by Pengfei Zheng All rights reserved except the rights granted by the Creative Commons Attribution-Noncommercial Licence Abstract As the democratization of global-scale web applications and cloud computing, under- standing the performance of a live production datacenter becomes a prerequisite for making strategic decisions related to datacenter design and optimization. Advances in monitoring, tracing, and profiling large, complex systems provide rich datasets and establish a rigorous foundation for performance understanding and reasoning. But the sheer volume and complexity of collected data challenges existing techniques, which rely heavily on human intervention, expert knowledge, and simple statistics. In this dissertation, we address this challenge using artificial intelligence and make the case for two important problems, datacenter performance diagnosis and datacenter workload characterization. The first thrust of this dissertation is the use of statistical causal inference and Bayesian probabilistic model for datacenter straggler diagnosis.
    [Show full text]
  • Characterizing, Modeling, and Benchmarking Rocksdb Key-Value
    Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook Zhichao Cao, University of Minnesota, Twin Cities, and Facebook; Siying Dong and Sagar Vemuri, Facebook; David H.C. Du, University of Minnesota, Twin Cities https://www.usenix.org/conference/fast20/presentation/cao-zhichao This paper is included in the Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST ’20) February 25–27, 2020 • Santa Clara, CA, USA 978-1-939133-12-0 Open access to the Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST ’20) is sponsored by Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook Zhichao Cao†‡ Siying Dong‡ Sagar Vemuri‡ David H.C. Du† †University of Minnesota, Twin Cities ‡Facebook Abstract stores is still challenging. First, there are very limited studies of real-world workload characterization and analysis for KV- Persistent key-value stores are widely used as building stores, and the performance of KV-stores is highly related blocks in today’s IT infrastructure for managing and storing to the workloads generated by applications. Second, the an- large amounts of data. However, studies of characterizing alytic methods for characterizing KV-store workloads are real-world workloads for key-value stores are limited due to different from the existing workload characterization stud- the lack of tracing/analyzing tools and the difficulty of collect- ies for block storage or file systems. KV-stores have simple ing traces in operational environments. In this paper, we first but very different interfaces and behaviors. A set of good present a detailed characterization of workloads from three workload collection, analysis, and characterization tools can typical RocksDB production use cases at Facebook: UDB (a benefit both developers and users of KV-stores by optimizing MySQL storage layer for social graph data), ZippyDB (a dis- performance and developing new functions.
    [Show full text]
  • Myrocks in Mariadb
    MyRocks in MariaDB Sergei Petrunia <[email protected]> MariaDB Shenzhen Meetup November 2017 2 What is MyRocks ● #include <Yoshinori’s talk> ● This talk is about MyRocks in MariaDB 3 MyRocks lives in Facebook’s MySQL branch ● github.com/facebook/mysql-5.6 – Will call this “FB/MySQL” ● MyRocks lives there in storage/rocksdb ● FB/MySQL is easy to use if you are Facebook ● Not so easy if you are not :-) 4 FB/mysql-5.6 – user perspective ● No binaries, no packages – Compile yourself from source ● Dependencies, etc. ● No releases – (Is the latest git revision ok?) ● Has extra features – e.g. extra counters “confuse” monitoring tools. 5 FB/mysql-5.6 – dev perspective ● Targets a CentOS-type OS – Compiler, cmake version, etc. – Others may or may not [periodically] work ● MariaDB/Percona file pull requests to fix ● Special command to compile – https://github.com/facebook/mysql-5.6/wiki/Build-Steps ● Special command to run tests – Test suite assumes a big machine ● Some tests even a release build 6 Putting MyRocks in MariaDB ● Goals – Wider adoption – Ease of use – Ease of development – Have MyRocks in MariaDB ● Use it with MariaDB features ● Means – Port MyRocks into MariaDB – Provide binaries and packages 7 Status of MyRocks in MariaDB 8 Status of MyRocks in MariaDB ● MariaDB 10.2 is GA (as of May, 2017) ● It includes an ALPHA version of MyRocks plugin – Working to improve maturity ● It’s a loadable plugin (ha_rocksdb.so) ● Packages – Bintar, deb, rpm, win64 zip + MSI – deb/rpm have MyRocks .so and tools in a separate package. 9 Packaging for MyRocks in MariaDB 10 MyRocks and RocksDB library ● MyRocks is tied RocksDB@revno MariaDB – RocksDB is a github submodule – No compatibility with other versions MyRocks ● RocksDB is always compiled with RocksDB MyRocks S Z n ● l i And linked-in statically a b p ● p Distros have a RocksDB package y – Not using it.
    [Show full text]
  • A Persistent Storage Model for Extreme Computing Shuangyang Yang Louisiana State University and Agricultural and Mechanical College
    Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2014 A Persistent Storage Model for Extreme Computing Shuangyang Yang Louisiana State University and Agricultural and Mechanical College Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_dissertations Part of the Computer Sciences Commons Recommended Citation Yang, Shuangyang, "A Persistent Storage Model for Extreme Computing" (2014). LSU Doctoral Dissertations. 2910. https://digitalcommons.lsu.edu/gradschool_dissertations/2910 This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Doctoral Dissertations by an authorized graduate school editor of LSU Digital Commons. For more information, please [email protected]. A PERSISTENT STORAGE MODEL FOR EXTREME COMPUTING A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The Department of Computer Science by Shuangyang Yang B.S., Zhejiang University, 2006 M.S., University of Dayton, 2008 December 2014 Copyright © 2014 Shuangyang Yang All rights reserved ii Dedicated to my wife Miao Yu and our daughter Emily. iii Acknowledgments This dissertation would not be possible without several contributions. It is a great pleasure to thank Dr. Hartmut Kaiser @ Louisiana State University, Dr. Walter B. Ligon III @ Clemson University and Dr. Maciej Brodowicz @ Indiana University for their ongoing help and support. It is a pleasure also to thank Dr. Bijaya B. Karki, Dr. Konstantin Busch, Dr. Supratik Mukhopadhyay at Louisiana State University for being my committee members and Dr.
    [Show full text]
  • Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques
    Comparative Analysis of Distributed and Parallel File Systems’ Internal Techniques Viacheslav Dubeyko Content 1 TERMINOLOGY AND ABBREVIATIONS ................................................................................ 4 2 INTRODUCTION......................................................................................................................... 5 3 COMPARATIVE ANALYSIS METHODOLOGY ....................................................................... 5 4 FILE SYSTEM FEATURES CLASSIFICATION ........................................................................ 5 4.1 Distributed File Systems ............................................................................................................................ 6 4.1.1 HDFS ..................................................................................................................................................... 6 4.1.2 GFS (Google File System) ....................................................................................................................... 7 4.1.3 InterMezzo ............................................................................................................................................ 9 4.1.4 CodA .................................................................................................................................................... 10 4.1.5 Ceph.................................................................................................................................................... 12 4.1.6 DDFS ..................................................................................................................................................
    [Show full text]
  • Dmon: Efficient Detection and Correction of Data Locality
    DMon: Efficient Detection and Correction of Data Locality Problems Using Selective Profiling Tanvir Ahmed Khan and Ian Neal, University of Michigan; Gilles Pokam, Intel Corporation; Barzan Mozafari and Baris Kasikci, University of Michigan https://www.usenix.org/conference/osdi21/presentation/khan This paper is included in the Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. July 14–16, 2021 978-1-939133-22-9 Open access to the Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. DMon: Efficient Detection and Correction of Data Locality Problems Using Selective Profiling Tanvir Ahmed Khan Ian Neal Gilles Pokam Barzan Mozafari University of Michigan University of Michigan Intel Corporation University of Michigan Baris Kasikci University of Michigan Abstract cally at run time. In fact, as we (§6.2) and others [2,15,20,27] Poor data locality hurts an application’s performance. While demonstrate, compiler-based techniques can sometimes even compiler-based techniques have been proposed to improve hurt performance when the assumptions made by those heuris- data locality, they depend on heuristics, which can sometimes tics do not hold in practice. hurt performance. Therefore, developers typically find data To overcome the limitations of static optimizations, the locality issues via dynamic profiling and repair them manually. systems community has invested substantial effort in devel- Alas, existing profiling techniques incur high overhead when oping dynamic profiling tools [28,38, 57,97, 102]. Dynamic used to identify data locality problems and cannot be deployed profilers are capable of gathering detailed and more accurate in production, where programs may exhibit previously-unseen execution information, which a developer can use to identify performance problems.
    [Show full text]
  • Real-Time LSM-Trees for HTAP Workloads
    Real-Time LSM-Trees for HTAP Workloads Hemant Saxena Lukasz Golab University of Waterloo University of Waterloo [email protected] [email protected] Stratos Idreos Ihab F. Ilyas Harvard University University of Waterloo [email protected] [email protected] ABSTRACT We observe that a Log-Structured Merge (LSM) Tree is a natu- Real-time data analytics systems such as SAP HANA, MemSQL, ral fit for a lifecycle-aware storage engine. LSM-Trees are widely and IBM Wildfire employ hybrid data layouts, in which dataare used in key-value stores (e.g., Google’s BigTable and LevelDB, Cas- stored in different formats throughout their lifecycle. Recent data sandra, Facebook’s RocksDB), RDBMSs (e.g., Facebook’s MyRocks, are stored in a row-oriented format to serve OLTP workloads and SQLite4), blockchains (e.g., Hyperledger uses LevelDB), and data support high data rates, while older data are transformed to a stream and time-series databases (e.g., InfluxDB). While Cassandra column-oriented format for OLAP access patterns. We observe that and RocksDB can simulate columnar storage via column families, a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycle- we are not aware of any lifecycle-aware LSM-Trees in which the aware storage engine due to its high write throughput and level- storage layout can change throughout the lifetime of the data. We oriented structure, in which records propagate from one level to fill this gap in our work, by extending the capabilities ofLSM- the next over time. To build a lifecycle-aware storage engine using based systems to efficiently serve real-time analytics and HTAP an LSM-Tree, we make a crucial modification to allow different workloads.
    [Show full text]