Isolation File Systems

Total Page:16

File Type:pdf, Size:1020Kb

Isolation File Systems Physical Disentanglement in a Container-Based File System Lanyue Lu, Yupu Zhang, Thanh Do, Samer AI-Kiswany Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin Madison Motivation Isolation in various subsystems ➡ resources: virtual machines, Linux containers ➡ security: BSD jail, sandbox ➡ reliability: address spaces File systems lack isolation ➡ physical entanglement in modern file systems Three problems due to entanglement ➡ global failures, slow recovery, bundled performance Global Failures Definition ➡ a failure which impacts all users of the file system or even the operating system Read-Only ➡ mark the file system as read-only ➡ e.g., metadata corruption, I/O failure Crash ➡ crash the file system or the operating system ➡ e.g., unexpected states, pointer faults Ext3 200 Read-Only Crash 150 100 50 Number of Failure Instances 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Linux 3.X Versions Ext4 Read-Only Crash 400 350 300 250 200 150 100 50 Number of Failure Instances 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Linux 3.X Versions Slow Recovery File system checker ➡ repair a corrupted file system ➡ usually offline Current checkers are not scalable ➡ increasing disk capacities and file system sizes ➡ scan the whole file system ➡ checking time is unacceptably long Scalability of fsck on Ext3 Ext3 1007 1000 800 723 600 476 400 Fsck Time (s) 231 200 0 200GB 400GB 600GB 800GB File-system Capacity Bundled Performance Shared transaction ➡ all updates share a single transaction ➡ unrelated workloads affect each other Consistency guarantee ➡ the same journal mode for all files ➡ limited flexibility for different tradeoffs Bundle Performance on Ext3 180 SQLite Varmail 146.7 150 120 90 76.1 60 Throughput (MB/s) 30 20 1.9 0 Alone Together Server Virtualization VM1 VM2 VM3 VM4 VM5 VD1 VD2 VD3 Shared file system VD4 VD5 All VMs will crash VM1 VM2 VM3 VM4 VM5 metadata VD1 VD2 VD3 Read-Only corruption or Crash VD4 VD5 Virtual Machines VM1 VM2 VM3 100 80 60 fsck: 496s + bootup: 68s 40 20 Throughput (IOPS) 0 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 Time (Second) Our Solution A new abstraction for disentanglement A disentangled file system: IceFS Three major benefits of IceFS ➡ failure isolation ➡ localized recovery ➡ specialized journaling Motivation IceFS Evaluation Conclusion A New Abstraction: Cube What is a cube ➡ an independent container for a group of files ➡ physically isolated Interface ➡ create a cube: mkdir(cube_flag) ➡ delete a cube: rmdir() ➡ add files to a cube ➡ remove files to a cube ➡ set cube attributes Cube Example Cube 2 Cube 1 / a c2 c1 b d A Disentangled File System No shared physical resources ➡ no shared metadata: e.g., block groups ➡ no shared disk blocks or buffers No access dependency ➡ partition linked lists or trees ➡ avoid directory hierarchy dependency No bundled transactions ➡ use separate transactions ➡ enable customized journaling modes IceFS A prototype based on Ext3 in Linux 3.5 Disentanglement ➡ directory indirection ➡ transaction splitting Directory Indirection Cube 2 Cube 1 / a c2 c1 b d Pathname matching for cubes’ top directory paths Cubes’ dentries are pinned in memory Transaction Splitting App 1 App 2 App 3 App 1 App 2 App 3 on-disk on-disk journal journal in-memory tx preallocated tx committed tx (a) Ext3/Ext4 (b) IceFS Separate transactions from different cubes commit to the disk journal in parallel Benefits of Disentanglement Localized reactions to failures ➡ per-cube read-only and crash Localized recovery ➡ only check faulty cubes ➡ offline and online Specialized journaling ➡ parallel journaling ➡ diverse journal modes ➡ no journal, no fsync, writeback, ordered, data Motivation IceFS Evaluation Conclusion Fast Recovery Ext3 IceFS 1007 1000 800 723 600 476 400 Fsck Time (s) 231 200 91 122 35 64 0 200GB 400GB 600GB 800GB File-system Capacity Specialized Journaling OR: ordered SQLite Varmail 250 220.3 NJ: no journal 200 150 120.6 125.4 103.4 100 76.1 Throughput (MB/s) 50 1.9 9.8 5.6 0 OR OR OR OR NJ OR OR NJ Ext3 IceFS Server Virtualization VM1 VM2 VM3 VM4 VM5 metadata VD1 VD2 VD3 corruption Shared file system VD4 VD5 Server Virtualization VM1 VM2 VM3 100 IceFS-Offline fsck: 35s 80 + bootup: 67s 60 40 20 0 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 100 IceFS-Online fsck: 74s 80 + bootup: 39s Throughput (IOPS) 60 40 20 0 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 Time (Second) Conclusion File systems lack physical isolation Our contribution ➡ a new abstraction: cube ➡ a disentangled file system: IceFS ➡ demonstrate its benefits: ➡ isolated failures ➡ localized recovery ➡ specialized journaling ➡ improving both reliability and performance Questions ? IceFS Disk Layout cube inode number, cube pathname orphan inode list, cube attributes SB S0 S1 S2 Sn bg bg bg bg bg cube 0 cube 1 cube 2 Each cube has one sub-super block (Si) and its own block groups (bg) .
Recommended publications
  • A Brief History of UNIX File Systems
    A Brief History of UNIX File Systems Val Henson IBM, Inc. [email protected] Summary • Review of UNIX file system concepts • File system formats, 1974-2004 • File system comparisons and recommendations • Fun trivia • Questions and answers (corrections ONLY during talk) 1 VFS/vnode architecture • VFS: Virtual File System: common object-oriented interface to fs's • vnode: virtual node: abstract file object, includes vnode ops • All operations to fs's and files done through VFS/vnode in- terface • S.R. Kleiman, \Vnodes: An Architecture for Multiple File System Types in Sun UNIX," Summer USENIX 1986 2 Some Definitions superblock: fs summary, pointers to other information inode: on-disk structure containing information about a file indirect block: block containing pointers to other blocks metadata: everything that is not user data, including directory entries 3 Disk characteristics • Track - contiguous region, can be read at maximum speed • Seek time - time to move the head between different tracks • Rotational delay - time for part of track to move under head • Fixed per I/O overhead means bigger I/Os are better 4 In the beginning: System V FS (S5FS) (c. 1974) • First UNIX file system, referred to as \FS" • Disk layout: superblock, inodes, followed by everything else • 512-1024 byte block size, no fragments • Super simple - and super slow! 2-5% of raw disk bandwidth 5 Berkeley Fast File System (FFS or UFS) (c. 1984) • Metadata spread throughout the disk in \cylinder groups" • Block size 4KB minimum, frag size 1KB (to avoid 45% wasted space) • Physical
    [Show full text]
  • Set Hadoop-Specific Environment Variables Here
    # Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. Required. export JAVA_HOME=/usr/jdk64/jdk1.8.0_112 export HADOOP_HOME_WARN_SUPPRESS=1 # Hadoop home directory export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/2.6.5.0-292/hadoop} # Hadoop Configuration Directory # Path to jsvc required by secure HDP 2.0 datanode export JSVC_HOME=/usr/lib/bigtop-utils # The maximum amount of heap to use, in MB. Default is 1000. export HADOOP_HEAPSIZE="1024" export HADOOP_NAMENODE_INIT_HEAPSIZE="-Xms1024m" # Extra Java runtime options. Empty by default. export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}" USER="$(whoami)" # Command specific options appended to HADOOP_OPTS when specified HADOOP_JOBTRACKER_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC - XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m - Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps - XX:+PrintGCDateStamps -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dmapred.audit.logger=INFO,MRAUDIT - Dhadoop.mapreduce.jobsummary.logger=INFO,JSA ${HADOOP_JOBTRACKER_OPTS}" HADOOP_TASKTRACKER_OPTS="-server -Xmx1024m -Dhadoop.security.logger=ERROR,console -Dmapred.audit.logger=ERROR,console ${HADOOP_TASKTRACKER_OPTS}" SHARED_HADOOP_NAMENODE_OPTS="-server
    [Show full text]
  • Filesystem Considerations for Embedded Devices ELC2015 03/25/15
    Filesystem considerations for embedded devices ELC2015 03/25/15 Tristan Lelong Senior embedded software engineer Filesystem considerations ABSTRACT The goal of this presentation is to answer a question asked by several customers: which filesystem should you use within your embedded design’s eMMC/SDCard? These storage devices use a standard block interface, compatible with traditional filesystems, but constraints are not those of desktop PC environments. EXT2/3/4, BTRFS, F2FS are the first of many solutions which come to mind, but how do they all compare? Typical queries include performance, longevity, tools availability, support, and power loss robustness. This presentation will not dive into implementation details but will instead summarize provided answers with the help of various figures and meaningful test results. 2 TABLE OF CONTENTS 1. Introduction 2. Block devices 3. Available filesystems 4. Performances 5. Tools 6. Reliability 7. Conclusion Filesystem considerations ABOUT THE AUTHOR • Tristan Lelong • Embedded software engineer @ Adeneo Embedded • French, living in the Pacific northwest • Embedded software, free software, and Linux kernel enthusiast. 4 Introduction Filesystem considerations Introduction INTRODUCTION More and more embedded designs rely on smart memory chips rather than bare NAND or NOR. This presentation will start by describing: • Some context to help understand the differences between NAND and MMC • Some typical requirements found in embedded devices designs • Potential filesystems to use on MMC devices 6 Filesystem considerations Introduction INTRODUCTION Focus will then move to block filesystems. How they are supported, what feature do they advertise. To help understand how they compare, we will present some benchmarks and comparisons regarding: • Tools • Reliability • Performances 7 Block devices Filesystem considerations Block devices MMC, EMMC, SD CARD Vocabulary: • MMC: MultiMediaCard is a memory card unveiled in 1997 by SanDisk and Siemens based on NAND flash memory.
    [Show full text]
  • Linux System Administration
    Linux System Administration Jonathan Quick Hartebeesthoek Radio Astronomy Observatory Goals • Help you to understand how Linux starts up, keeps running, and shuts down • Give confidence in dealing with hardware and software failures • Give an overview of what you can configure and how • Show you where to find more information when you need it • For the field system and Mark5’s 2 Basic Linux Concepts • Linux Kernel – Base monolithic kernel + loadable modules – Gives standardized access to underlying hardware • Linux System / "Distribution" – Kernel + lots of software – Adds both system and application level software to the system • Background processes ("daemons") 3 System Modifications • In order to do any system-wide changes you usually have to be logged in as 'root‘ – Or have root privileges • There are a number of approaches for this – Log in as user “root” – Execute “su –” from the present user account – Execute the command directly with “sudo” • E.g. “sudo tail /var/log/kern.log” 4 Logging in as 'root' • You can change to a virtual console (Ctrl-Alt- F1) and login normally or use 'su -' • 'root' can override all permissions, start and stop anything, erase hard drives,... – So please be careful with disk names and similar! – You can browse and check many (if not most of the) things as a normal user (like 'oper'). 5 Sudo • Sudo is a program designed to allow a sysadmin to give limited root privileges to users and log root activity. • The basic philosophy is to give as few privileges as possible but still allow people to get their work
    [Show full text]
  • Altavault Cloud Integrated Storage Command-Line Reference Guide
    NetApp® AltaVault® Cloud Integrated Storage 4.2.1 Command-Line Reference Guide NetApp, Inc. Telephone: +1 (408) 822-6000 Part number: 215-11346_A0 495 East Java Drive Fax: + 1 (408) 822-4501 August 2016 Sunnyvale, CA 94089 Support telephone: +1(888) 463-8277 U.S. Web: www.netapp.com Feedback: [email protected] Contents Beta Draft Contents Chapter 1 - Using the Command-Line Interface ......................................................................................3 Connecting to the CLI ......................................................................................................................................3 Overview of the CLI.........................................................................................................................................4 Entering Commands .........................................................................................................................................5 Accessing CLI Online Help..............................................................................................................................5 Error Messages .................................................................................................................................................5 Command Negation..........................................................................................................................................5 Running the Configuration Wizard...................................................................................................................6
    [Show full text]
  • Linux System Administration
    Linux System Administration Jonathan Quick, Hartebeesthoek Radio Astronomy Observatory Ari Mujunen, Metsähovi Radio Observatory Linux Startup and Shutdown Managing Hard Disks, Partitions, Backups Rescuing a Failing PC / System Modifying Configuration Adding/Removing Packages 1 Goals Help you to understand how Linux starts up, keeps running, and shuts down Give confidence in dealing with hardware and software failures Give an overview of what you can configure and how Show you where to find more information when you need it 2 Basic Linux Concepts Linux Kernel Base monolithic kernel + loadable modules Gives standardized access to underlying hardware Linux System / "Distribution" Kernel + lots of software Adds both system and application level software to the system Background processes ("daemons") 3 Logging in as 'root' In order to do any system-wide changes you usually have to be logged in as 'root' You can change to a virtual console (Ctrl-Alt- F1) and login normally or use 'su -' 'root' can override all permissions, start and stop anything, erase hard drives,... So please be careful with disk names and similar! You can browse and check many (if not most of the) things as a normal user (like 'oper'). 4 Getting System Information ps axf, top; kill, kill -9 free df, mount netstat -an, ifconfig, route -n w, who cat /proc/cpuinfo (and others) 5 Linux PC-Level Startup PC ROM BIOS initializes hardware and boots a Master Boot Record (MBR) From a floppy, hard disk, CD-ROM, ... That MBR contains LILO, the Linux Loader
    [Show full text]
  • File System and Scheduling Utilities (FSSU)
    Preliminary Specification Systems Management: File System and Scheduling Utilities (FSSU) RELIM P I N A R Y [This page intentionally left blank] X/Open Preliminary Specification Systems Management: File System and Scheduling Utilities (FSSU) X/Open Company Ltd. October 1995, X/Open Company Limited All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owners. X/Open Preliminary Specification Systems Management: File System and Scheduling Utilities (FSSU) ISBN: 1-85912-145-4 X/Open Document Number: P521 Published by X/Open Company Ltd., U.K. Any comments relating to the material contained in this document may be submitted to X/Open at: X/Open Company Limited Apex Plaza Forbury Road Reading Berkshire, RG1 1AX United Kingdom or by Electronic Mail to: [email protected] ii X/Open Preliminary Specification Contents Chapter 1 Introduction............................................................................................... 1 1.1 Core Facilities .............................................................................................. 1 1.2 NFS Feature Group..................................................................................... 1 Chapter 2 Scheduling Management .................................................................. 3 cron...................................................................................................................
    [Show full text]
  • Maintaining a File System File System Integrity Utility: Fsck -P [Filesystem]
    Maintaining a File System File System Integrity utility: fsck -p [fileSystem] fsck (file system check) scans the specified file systems and checks them for consistency. The kind of consistency errors that can exist include: • A block is marked as free in the bitmap but is also referenced from an inode. • A block is marked as used in the bitmap but is never referenced from an inode. • More than one inode refers to the same block. • An invalid block number. • An inode's link count is incorrect. • A used inode is not referenced from any directory. 1 file system integrity fsck -p [fileSystem] If the -p option is used, fsck automatically corrects any errors that it finds. Without the -p option, it prompts the user for confirmation of any corrections that it suggests. If fsck finds a block that is used but is not associated with a named file, it connects it to a file whose name is equal to the block's inode number in the "/lost+found" directory. If no file systems are specified, fsck checks the standard file systems listed in "/etc/fstab." Linux has specialized fsck programs for different types of file systems. For example, when checking an ext2 or ext3 file system, fsck act as a front-end to e2fsck, which is the program that actually checks the file system. 2 Display disk statistics My disk is full, my files are not saved, why?!@#$ du -- display disk usage displays the number of kB that are allocated to each of the specified filenames. If a filename refers to a directory, its files are recursively described -h option displays more human-readable
    [Show full text]
  • A Ffsck: the Fast File System Checker
    A ffsck: The Fast File System Checker AO MA, University of Wisconsin, Madison; Backup Recovery Systems Division, EMC Corporation CHRIS DRAGGA, ANDREA C. ARPACI-DUSSEAU, and REMZI H. ARPACI-DUSSEAU, University of Wisconsin, Madison MARSHALL KIRK McKUSICK, McKusick.com Crash failures, hardware errors, and file system bugs can corrupt file systems and cause data loss, despite the presence of journals and similar preventive techniques. While consistency checkers such as fsck can detect this corruption and restore a damaged image to a usable state, they are generally created as an afterthought, to be run only at rare intervals. Thus, checkers operate slowly, causing significant downtime for large scale storage systems when they are needed. We address this dilemma by treating the checker as a key component of the overall file system (and not merely a peripheral add-on). To this end, we present a modified ext3 file system, rext3, to directly support the fast file system checker, ffsck. The rext3 file system colocates and self-identifies its metadata blocks, removing the need for costly seeks and tree traversals during checking. These modifications to the file system allow ffsck to scan and repair the file system at rates approaching the full sequential bandwidth of the underlying device. In addition, we demonstrate that rext3 performs competitively with ext3 in most cases and exceeds it in handling random reads and large writes. Finally, we apply our principles to FFS, the default FreeBSD file system, and its checker, doing so in a lightweight fashion that preserves the file-system layout while still providing some of the gains in performance from ffsck.
    [Show full text]
  • Freebsd Command Reference
    FreeBSD command reference Command structure Each line you type at the Unix shell consists of a command optionally followed by some arguments , e.g. ls -l /etc/passwd | | | cmd arg1 arg2 Almost all commands are just programs in the filesystem, e.g. "ls" is actually /bin/ls. A few are built- in to the shell. All commands and filenames are case-sensitive. Unless told otherwise, the command will run in the "foreground" - that is, you won't be returned to the shell prompt until it has finished. You can press Ctrl + C to terminate it. Colour code command [args...] Command which shows information command [args...] Command which modifies your current session or system settings, but changes will be lost when you exit your shell or reboot command [args...] Command which permanently affects the state of your system Getting out of trouble ^C (Ctrl-C) Terminate the current command ^U (Ctrl-U) Clear to start of line reset Reset terminal settings. If in xterm, try Ctrl+Middle mouse button stty sane and select "Do Full Reset" exit Exit from the shell logout ESC :q! ENTER Quit from vi without saving Finding documentation man cmd Show manual page for command "cmd". If a page with the same man 5 cmd name exists in multiple sections, you can give the section number, man -a cmd or -a to show pages from all sections. man -k str Search for string"str" in the manual index man hier Description of directory structure cd /usr/share/doc; ls Browse system documentation and examples. Note especially cd /usr/share/examples; ls /usr/share/doc/en/books/handbook/index.html cd /usr/local/share/doc; ls Browse package documentation and examples cd /usr/local/share/examples On the web: www.freebsd.org Includes handbook, searchable mailing list archives System status Alt-F1 ..
    [Show full text]
  • The Linux Command Line
    The Linux Command Line Second Internet Edition William E. Shotts, Jr. A LinuxCommand.org Book Copyright ©2008-2013, William E. Shotts, Jr. This work is licensed under the Creative Commons Attribution-Noncommercial-No De- rivative Works 3.0 United States License. To view a copy of this license, visit the link above or send a letter to Creative Commons, 171 Second Street, Suite 300, San Fran- cisco, California, 94105, USA. Linux® is the registered trademark of Linus Torvalds. All other trademarks belong to their respective owners. This book is part of the LinuxCommand.org project, a site for Linux education and advo- cacy devoted to helping users of legacy operating systems migrate into the future. You may contact the LinuxCommand.org project at http://linuxcommand.org. This book is also available in printed form, published by No Starch Press and may be purchased wherever fine books are sold. No Starch Press also offers this book in elec- tronic formats for most popular e-readers: http://nostarch.com/tlcl.htm Release History Version Date Description 13.07 July 6, 2013 Second Internet Edition. 09.12 December 14, 2009 First Internet Edition. 09.11 November 19, 2009 Fourth draft with almost all reviewer feedback incorporated and edited through chapter 37. 09.10 October 3, 2009 Third draft with revised table formatting, partial application of reviewers feedback and edited through chapter 18. 09.08 August 12, 2009 Second draft incorporating the first editing pass. 09.07 July 18, 2009 Completed first draft. Table of Contents Introduction....................................................................................................xvi
    [Show full text]
  • Chunkfs: Using Divide–And–Conquer to Improve File System Reliability And
    Chunkfs: Using divide–and–conquer to improve file system reliability and repair Val Henson Arjan van de Ven Open Source Technology Center Open Source Technology Center Intel Corporation Intel Corporation val [email protected] [email protected] Amit Gud Zach Brown Kansas State University Oracle, Inc. [email protected] [email protected] Abstract First, disk I/O bandwidth is not keeping up with capac- ity. Between 2006 and 2013, disk capacity is projected to The absolute time required to check and repair a file increase by 16 times, but disk bandwidth by only 5 times. system is increasing because disk capacities are growing As a result, it will take about 3 times longer to read an faster than disk bandwidth and seek time remains almost entire disk. It’s as if your milkshake got 3 times big- unchanged. At the same time, file system repair is be- ger, but the size of your straw stayed the same. Second, coming more common, because the per–bit error rate of seek time will stay almost flat over the next 7 years, im- disks is not dropping as fast as the number of bits per proving by a pitiful factor of only 1.2. The performance disk is growing, resulting in more errors per disk. With of workloads with any significant number of seeks (i.e., existing file systems, a single corrupted metadata block most workloads) will not scale with the capacity of the requires the entire file system to be unmounted, checked, disk. Third, the per–bit error rate is not improving as and repaired—a process that takes hours or days to com- fast as disk capacity is growing.
    [Show full text]