“Application - File System” Divide with Promises

Total Page:16

File Type:pdf, Size:1020Kb

“Application - File System” Divide with Promises Bridging the “Application - File System” divide with promises Raja Bala Computer Sciences Department University of Wisconsin, Madison, WI [email protected] Abstract that hook into the file system and the belief that the underlying file system is the best judge File systems today implement a limited set of when it comes to operations with files. Unfor- abstractions and semantics wherein applications tunately, the latter isn’t true, since applications don’t really have much of a say. The generality know more about their behavior and what they of these abstractions tends to curb the application need or do not need from the file system. Cur- performance. In the global world we live in, it seems rently, there is no real mechanism that allows reasonable that applications are treated as first-class the applications to communicate this informa- citizens by the file system layer. tion to the file system and thus have some degree In this project, we take a first step towards that goal of control over the file system functionality. by leveraging promises that applications make to the file system. The promises are then utilized to deliver For example, an application that never ap- a better-tuned and more application-oriented file pends to any of the files it creates has no means system. A very simple promise, called unique-create of conveying this information to the file sys- was implemented, wherein the application vows tem. Most file systems inherently assume that never to create a file with an existing name (in a it is good to preallocate extra blocks to a file, directory) which is then used by the file system so that when it expands, the preallocated blocks to speedup creation time. The application com- can be used. This is a simple optimization based municates this promise to the virtual file system on the realization that a sequential read is a lot (VFS) layer using its process state. VFS then skips faster than one interspersed with random seeks. the directory cache and real filesystem lookups, For the aforementioned application, this would resulting in around 10% performance improvement translate to a bunch of empty blocks that won’t in file creation time. be used unless there is a shortage of blocks. Such fine grained control is completely absent in the current file system implementation. 1 Introduction It is interesting to compare the hourglass de- The file system layer is a black box of limited sign of the traditional file system in Linux/U- abstractions in the eyes of applications. This nix to that of the networking stack. In the lat- is due to the limited number of system calls ter, the Internet Protocol layer (IP) forms the 1 neck of the hourglass and enables communica- 2 Related Work tion between different link layer and transport Patterson et al [1] utilize the disclosure of ap- layer protocols as long as they use IP datagrams. plication knowledge of future accesses to en- In storage systems, the POSIX file system API able informed prefetching and caching. Dis- serves a similar purpose of allowing transparent closing hints are issued via I/O control (ioctl) communication between applications and stor- calls. Their work focused primarily on read re- age systems. There is a difference however, in quests and utilizes the application’s knowledge that IP has optional fields that can be leveraged for proactive resource management and control by the other layers to exchange information that over caching policies. They found that prefetch- wasn’t thought of during the design. This has ing and caching based on the application level actually been utilized by transport protocols to information reduced file access read latency in provide hints to lower layers in wireless envi- both local and network file systems. The hints ronments [5][6]. The virtual file system design may or may not be used by the file system, and on the other hand does not have any extensible hence, are more like guidelines, which differs mechanism for communication between the ap- from promises, which are always checked by the plication layer and the low level file systems. file system. In this paper, the notion of a promise is Steere [2] proposes dynamic sets as an operat- introduced as the extensible vehicle of com- ing system abstraction to address the problem of munication between these layers.. The local I/O latency by exposing the application’s non- knowledge that an application possesses is determinism and future data needs to the sys- called a promise and it is passed onto the kernel tem, which then exploits this new information as part of the process state. The file system to reduce latency by improving scheduling and implementation has been modified to check for ordering of the I/O accesses.Cao et al [4] fo- one particular promise, unique-create, and uti- cus on applications with large data sets and al- lize the presence of this promise to speed up file low them to express control over cache replace- creation time. Unique-create essentially tells ment. The application controlled replacement the file system that the application promises decisions combined with their LRU-SP (Least never to create a file with an existing name in Recently Used with Swapping and Placehold- that directory. ers) with kernel allocation policy reduces the number of disk I/O significantly. The rest of the paper is organized as follows. These works indicate the importance of us- Section 2 discusses some of the related work in ing the application’s knowledge to dictate some passing information from higher layers to the of the file system policies, but they’ve almost file system. Section 3 describes underlying as- always seemed to focus on read related issues. sumptions in promises and gives a few examples The application can leverage promises for both of useful promises. Section 4 describes the the read and write related optimization. The unique- unique-create promise in more detail, while Sec- create promise discussed in Section 4 is an ex- tion 5 discusses the implementation of unique- ample of a write optimization. create, the tests used to evaluate file creation with and without the promise and the results ob- served. Section 5 serves as the summary and conclusion of this work. 2 3 Underlying assumptions 4 The Unique-Create Promise So far, it has been assumed that the promises To assess the viability of promises, a simple are known before-hand. In reality, promises promise called “Unique-Create” was chosen. must be inferred from the applications, either by When an application makes this promise to the means of their very design and implementation, file system, it pledges that it will never create or by using formal analysis over the code or a file with an existing name in a directory’s by observing the arguments in the file system namespace. This promise is limited to the calls. This paper does not attempt to answer the particular case of the open system call wherein bootstrapping problem of identifying promises. both the flags O CREAT and O EXCL set. It If the application doesn’t keep its promise, then returns an error if the file already exists and cre- all bets are off. The behavior in such a situation ates the file otherwise. Now, putting this in the is a consequence of the implementation of the context of the unique-create promise, it means promise. It is the file systems duty to check that open( ) would never return an error, since for the promises an application makes. This, in the application promises the non-existence of turn, means that only the file system implemen- the file. To understand why this promise could tation is modified to handle the promise, and be useful, it is important to know the filename thus requires no change in the application code, lookup and create mechanism in Linux. which makes promises a practical solution. Figure 1 shows a high-level view of the Linux Some promises that could be of potential file system architecture. The user space con- use are: tains the applications and the GNU C Library i) The application could tell the file system that provides the interface for the file system about the importance of a file by the number of calls open, read, write, close. The system call times it reads or writes to it, to influence some interface acts as a switch and funnels system of the caching policies. Most of the related calls from the user space to the appropriate work focused on similar issues. end points in the kernel space. In Linux and ii) The application might want to disable jour- Unix based operating systems, the VFS is the naling for writes to some files it doesn’t care primary interface to the underlying file systems. about to speed up performance. It exports a common set of interfaces and iii) In distributed file systems, a lot of the abstracts them to the individual file systems permission checks occur at the master, before (such as ext3, ext4, btrfs, etc). There are two being routed to the chunk servers. Thus, the use caches for the file system objects at the VFS of low-level permission checks might not be level; the inode cache and the dentry (directory that important and an unnecessary operation. entry) cache. iv) In GFS, every chunk is limited to 64 megabytes. It might be interesting to see if this information could simplify the block allocation When a file name lookup happens, VFS uses logic and speedup writes. the file name and its parent directory and looks into the directory cache to check if the entry already exists.
Recommended publications
  • 11.7 the Windows 2000 File System
    830 CASE STUDY 2: WINDOWS 2000 CHAP. 11 11.7 THE WINDOWS 2000 FILE SYSTEM Windows 2000 supports several file systems, the most important of which are FAT-16, FAT-32, and NTFS (NT File System). FAT-16 is the old MS-DOS file system. It uses 16-bit disk addresses, which limits it to disk partitions no larger than 2 GB. FAT-32 uses 32-bit disk addresses and supports disk partitions up to 2 TB. NTFS is a new file system developed specifically for Windows NT and car- ried over to Windows 2000. It uses 64-bit disk addresses and can (theoretically) support disk partitions up to 264 bytes, although other considerations limit it to smaller sizes. Windows 2000 also supports read-only file systems for CD-ROMs and DVDs. It is possible (even common) to have the same running system have access to multiple file system types available at the same time. In this chapter we will treat the NTFS file system because it is a modern file system unencumbered by the need to be fully compatible with the MS-DOS file system, which was based on the CP/M file system designed for 8-inch floppy disks more than 20 years ago. Times have changed and 8-inch floppy disks are not quite state of the art any more. Neither are their file systems. Also, NTFS differs both in user interface and implementation in a number of ways from the UNIX file system, which makes it a good second example to study. NTFS is a large and complex system and space limitations prevent us from covering all of its features, but the material presented below should give a reasonable impression of it.
    [Show full text]
  • Verifying a High-Performance Crash-Safe File System Using a Tree Specification
    Verifying a high-performance crash-safe file system using a tree specification Haogang Chen,y Tej Chajed, Alex Konradi,z Stephanie Wang,x Atalay İleri, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich MIT CSAIL ABSTRACT 1 INTRODUCTION File systems achieve high I/O performance and crash safety DFSCQ is the first file system that (1) provides a precise by implementing sophisticated optimizations to increase disk fsync fdatasync specification for and , which allow appli- throughput. These optimizations include deferring writing cations to achieve high performance and crash safety, and buffered data to persistent storage, grouping many trans- (2) provides a machine-checked proof that its implementa- actions into a single I/O operation, checksumming journal tion meets this specification. DFSCQ’s specification captures entries, and bypassing the write-ahead log when writing to the behavior of sophisticated optimizations, including log- file data blocks. The widely used Linux ext4 is an example bypass writes, and DFSCQ’s proof rules out some of the of an I/O-efficient file system; the above optimizations allow common bugs in file-system implementations despite the it to batch many writes into a single I/O operation and to complex optimizations. reduce the number of disk-write barriers that flush data to The key challenge in building DFSCQ is to write a speci- disk [33, 56]. Unfortunately, these optimizations complicate fication for the file system and its internal implementation a file system’s implementation. For example, it took 6 years without exposing internal file-system details. DFSCQ in- for ext4 developers to realize that two optimizations (data troduces a metadata-prefix specification that captures the writes that bypass the journal and journal checksumming) properties of fsync and fdatasync, which roughly follows taken together can lead to disclosure of previously deleted the behavior of Linux ext4.
    [Show full text]
  • Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques
    Comparative Analysis of Distributed and Parallel File Systems’ Internal Techniques Viacheslav Dubeyko Content 1 TERMINOLOGY AND ABBREVIATIONS ................................................................................ 4 2 INTRODUCTION......................................................................................................................... 5 3 COMPARATIVE ANALYSIS METHODOLOGY ....................................................................... 5 4 FILE SYSTEM FEATURES CLASSIFICATION ........................................................................ 5 4.1 Distributed File Systems ............................................................................................................................ 6 4.1.1 HDFS ..................................................................................................................................................... 6 4.1.2 GFS (Google File System) ....................................................................................................................... 7 4.1.3 InterMezzo ............................................................................................................................................ 9 4.1.4 CodA .................................................................................................................................................... 10 4.1.5 Ceph.................................................................................................................................................... 12 4.1.6 DDFS ..................................................................................................................................................
    [Show full text]
  • Operating Systems File Systems
    COS 318: Operating Systems File Systems: Abstractions and Protection Jaswinder Pal Singh Computer Science Department Princeton University (http://www.cs.princeton.edu/courses/cos318/) Topics ◆ What’s behind the file system: Storage hierarchy ◆ File system abstraction ◆ File system protection 3 Traditional Data Center Storage Hierarchy WAN … LAN SAN Remote mirror Storage Server Clients Storage Offsite Onsite backup Backup 4 Evolved Data Center Storage Hierarchy WAN … LAN Remote Network mirror Attached w/ snapshots to protect data Clients Storage (NAS) Storage Offsite Onsite backup Backup 5 Alternative with no Tape WAN … LAN Remote Network mirror Attached w/ snapshots to protect data Clients Storage (NAS) Onsite Remote Backup Backup “Deduplication” WAN Capacity and bandwidth optimization 6 “Public Cloud” Storage Hierarchy … WAN WAN Interfaces Geo-plex Clients Examples: Google GFS, Spanner, Apple icloud, Amazon S3, Dropbox, Mozy, etc 7 Topics ◆ What’s behind the file system: Storage hierarchy ◆ File system abstraction ◆ File system protection 3 Revisit File System Abstractions ◆ Network file system ! Map to local file systems ! Exposes file system API ! NFS, CIFS, etc Network File System ◆ Local file system ! Implement file system abstraction on Local File System block storage ! Exposes file system API ◆ Volume manager Volume Manager ! Logical volumes of block storage ! Map to physical storage Physical storage ! RAID and reconstruction ! Exposes block API ◆ Physical storage ! Previous lectures 8 Volume Manager ◆ Group multiple storage partitions into a logical volume ! Grow or shrink without affecting existing data ! Virtualization of capacity and performance ◆ Reliable block storage ! Include RAID, tolerating device failures ! Provide error detection at block level ◆ Remote abstraction ! Block storage in the cloud ! Remote volumes for disaster recovery ! Remote mirrors can be split or merged for backups ◆ How to implement? ! OS kernel: Windows, OSX, Linux, etc.
    [Show full text]
  • Zfs-Ascalabledistributedfilesystemusingobjectdisks
    zFS-AScalableDistributedFileSystemUsingObjectDisks Ohad Rodeh Avi Teperman [email protected] [email protected] IBM Labs, Haifa University, Mount Carmel, Haifa 31905, Israel. Abstract retrieves the data block from the remote machine. zFS also uses distributed transactions and leases, instead of group- zFS is a research project aimed at building a decentral- communication and clustering software. We intend to test ized file system that distributes all aspects of file and stor- and show the effectiveness of these two features in our pro- age management over a set of cooperating machines inter- totype. connected by a high-speed network. zFS is designed to be zFS has six components: a Front End (FE), a Cooper- a file system that scales from a few networked computers to ative Cache (Cache), a File Manager (FMGR), a Lease several thousand machines and to be built from commodity Manager (LMGR), a Transaction Server (TSVR), and an off-the-shelf components. Object Store (OSD). These components work together to The two most prominent features of zFS are its coop- provide applications/users with a distributed file system. erative cache and distributed transactions. zFS integrates The design of zFS addresses, and is influenced by, issues the memory of all participating machines into one coher- of fault tolerance, security and backup/mirroring. How- ent cache. Thus, instead of going to the disk for a block ever, in this article, we focus on the zFS high-level archi- of data already in one of the machine memories, zFS re- tecture and briefly describe zFS’s fault tolerance character- trieves the data block from the remote machine.
    [Show full text]
  • File Systems
    File Systems CS 4410 Operating Systems [R. Agarwal, L. Alvisi, A. Bracy, M. George, E. Sirer, R. Van Renesse] The abstraction stack I/O systems are accessed Application through a series of layered Library abstractions File System File System API Block Cache& Performance Block Device Interface Device DriverDevice Access Memory-mapped I/O, DMA, Interrupts Physical Device The Block Cache Application Library • a cache for the disk File System File System API • caches recently read blocks Block Cache& Performance • buffers recently written blocks • serves as synchronization point Block Device Interface (ensures a block is only fetched Device DriverDevice Access once) Memory-mapped I/O, DMA, Interrupts Physical Device More Layers (not a 4410 focus) • allows data to be read or Application written in fixed-sized blocks Library • uniform interface to disparate devices File System File System API • translate between OS Block Cache& Performance abstractions and hw-specific Block Device Interface details of I/O devices Device DriverDevice Access • Control registers, bulk data Memory-mapped I/O, transfer, OS notifications DMA, Interrupts Physical Device Where shall we store our data? Process Memory? (why is this a bad idea?) 5 File Systems 101 Long-term Information Storage Needs • large amounts of information • information must survive processes • need concurrent access by multiple processes Solution: the File System Abstraction • Presents applications w/ persistent, named data • Two main components: • Files • Directories 6 The File Abstraction • File: a named collection of data • has two parts • data – what a user or application puts in it - array of untyped bytes • metadata – information added and managed by the OS - size, owner, security info, modification time 7 First things first: Name the File! 1.
    [Show full text]
  • Orion File System : File-Level Host-Based Virtualization
    Orion File System : File-level Host-based Virtualization Amruta Joshi Faraz Shaikh Sapna Todwal Pune Institute of Computer Pune Institute of Computer Pune Institute of Computer Technology, Technology, Technology, Dhankavadi, Pune 411043, India Dhankavadi, Pune 411043, India Dhankavadi, Pune 411043, India 020-2437-1101 020-2437-1101 020-2437-1101 [email protected] [email protected] [email protected] Abstract— The aim of Orion is to implement a solution that The automatic indexing of files and directories is called provides file-level host-based virtualization that provides for "semantic" because user programmable transducers use better aggregation of content/information based on information about these semantics of files to extract the semantics and properties. File-system organization today properties for indexing. The extracted properties are then very closely mirrors storage paradigms rather than user- stored in a relational database so that queries can be run access paradigms and semantic grouping. All file-system against them. Experimental results from our semantic file hierarchies are containers that are expressed based on their system implementation ORION show that semantic file physical presence (a separate drive letter on Windows, or a systems present a more effective storage abstraction than the particular mount point based on the volume in Unix). traditional tree structured file systems for information We have implemented a solution that will allow sharing, storage and retrieval. users to organize their files
    [Show full text]
  • Data Storage on Unix
    Data Storage on Unix Patrick Louis 2017-11-05 Published online on venam.nixers.net © Patrick Louis 2017 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of the rightful author. First published eBook format 2017 The author has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Contents Introduction 4 Ideas and Concepts 5 The Overall Generic Architecture 6 Lowest Level - Hardware & Limitation 8 The Medium ................................ 8 Connectors ................................. 10 The Drivers ................................. 14 A Mention on Block Devices and the Block Layer ......... 16 Mid Level - Partitions and Volumes Organisation 18 What’s a partitions ............................. 21 High Level 24 A Big Overview of FS ........................... 24 FS Examples ............................. 27 A Bit About History and the Origin of Unix FS .......... 28 VFS & POSIX I/O Layer ...................... 28 POSIX I/O 31 Management, Commands, & Forensic 32 Conclusion 33 Bibliography 34 3 Introduction Libraries and banks, amongst other institutions, used to have a filing system, some still have them. They had drawers, holders, and many tools to store the paperwork and organise it so that they could easily retrieve, through some documented process, at a later stage whatever they needed. That’s where the name filesystem in the computer world emerges from and this is oneofthe subject of this episode. We’re going to discuss data storage on Unix with some discussion about filesys- tem and an emphasis on storage device.
    [Show full text]
  • Beegfs Unofficial Documentation
    BeeGFS unofficial documentation Sep 24, 2019 Architecture 1 General Architecture 3 1.1 The Management Service........................................5 1.2 The Metadata Service..........................................5 1.3 The Storage Service...........................................5 1.4 The Client Service............................................7 1.5 Admon: Administration and Monitoring System............................7 2 Built-in Replication: Buddy Mirroring9 2.1 Storage Service Buddy Mirroring.................................... 10 2.2 Buddy Groups.............................................. 10 2.3 Metadata Service Buddy Mirroring................................... 11 2.4 Define Stripe Pattern........................................... 11 2.5 Enabling and disabling Mirroring.................................... 12 2.6 Restoring Metadata and Storage Target Data after Failures....................... 13 2.7 Caveats of Storage Mirroring...................................... 14 3 Storage Pools 17 4 Cloud Integration 19 5 Striping 21 5.1 Buddy Mirroring............................................. 21 5.2 Impact on network communication................................... 22 6 Client Tuning 23 6.1 Parallel Network Requests........................................ 23 6.2 Remote fsync............................................... 23 6.3 Disable locate/mlocate/updatedb..................................... 24 7 Getting started and typical Configurations 25 8 Installation and Setup 27 8.1 General Notes.............................................
    [Show full text]
  • Windows OS File Systems
    Windows OS File Systems MS-DOS and Windows 95/98/NT/2000/XP allow use of FAT-16 or FAT-32. Windows NT/2000/XP uses NTFS (NT File System) File Allocation Table (FAT) Not used so much, but look at as a contrast to other file systems. Disk Layout See old Tanenbaum Fig 8-18. Boot sector|Partition 1|Partion 2|... Each partition can be a different file system (including Unix). Each partition is laid out as: Secondary Boot Sector|FAT|Optional Duplicate FAT|Root directory|Data blocks... CS 4513 1 week2-windowsfs.tex File Allocation Table (FAT) Have a table with one entry for each block on the disk. Directory entry for a file contains the first block in the file. The FAT entry for this block then points to the next block of the file. Use EOF mark for last block. Blank entries indicate free blocks (no need for free block list). Example (also see old Tanenbaum Fig 8-19): Block Entry 0 Boot sector 1 Boot sector 2 3 3 7 4 Free 5 EOF 6 10 7 5 8 Free 9 Bad 10 EOF The boot sector contains boot instructions and descriptive info like: size of disk sector, number of physical sectors on disk per block, size of root directory. Bad entry indicates block is not usable. With 16-bit blocks have a maximum of 216 =64K blocks. To use a 2GB disk one would need 32K-byte blocks (very large). Can result in internal fragmentation for small files. FAT-32 uses 32-bit block numbers supporting 4GB of block numbers and disks up to 2 Terabytes in size.
    [Show full text]
  • Sibylfs: Formal Specification and Oracle-Based Testing for POSIX and Real-World File Systems
    SibylFS: formal specification and oracle-based testing for POSIX and real-world file systems Tom Ridge1 David Sheets2 Thomas Tuerk3 Andrea Giugliano1 Anil Madhavapeddy2 Peter Sewell2 1University of Leicester 2University of Cambridge 3FireEye http://sibylfs.io/ Abstract 1. Introduction Systems depend critically on the behaviour of file systems, Problem File systems, in common with several other key but that behaviour differs in many details, both between systems components, have some well-known but challeng- implementations and between each implementation and the ing properties: POSIX (and other) prose specifications. Building robust and portable software requires understanding these details and differences, but there is currently no good way to system- • they provide behaviourally complex abstractions; atically describe, investigate, or test file system behaviour • there are many important file system implementations, across this complex multi-platform interface. each with its own internal complexities; In this paper we show how to characterise the envelope • different file systems, while broadly similar, nevertheless of allowed behaviour of file systems in a form that enables behave quite differently in some cases; and practical and highly discriminating testing. We give a math- • other system software and applications often must be ematically rigorous model of file system behaviour, SibylFS, written to be portable between file systems, and file sys- that specifies the range of allowed behaviours of a file sys- tems themselves are sometimes ported from one OS to tem for any sequence of the system calls within our scope, another, or written to support application portability. and that can be used as a test oracle to decide whether an ob- served trace is allowed by the model, both for validating the File system behaviour, and especially these variations in be- model and for testing file systems against it.
    [Show full text]
  • User-Level Remote Data Access in Overlay Metacomputers
    User-Level Remote Data Access in Overlay Metacomputers Jeff Siegel and Paul Lu Department of Computing Science University of Alberta Edmonton, Alberta, Canada, T6G 2E8 siegel|paullu ¡ @cs.ualberta.ca Abstract 1 Introduction A practical problem faced by users of metacomputers High-speed wide-area networks (WAN) make it more at- and computational grids is: If my computation can move tractive to take advantage of computational resources at dif- from one system to another, how can I ensure that my data ferent computing centers. But, in practice, users tend to will still be available to my computation? Depending on access only the computers at their local center because that the level of software, technical, and administrative support is where their data is located. For metacomputing and grid available, a data grid or a distributed file system would be computing to flourish, applications must be able to run on reasonable solutions. However, it is not always possible (or any computer at any site and still have transparent access to practical) to have a diverse group of systems administrators their data files. agree to adopt a common infrastructure to support remote Traditional distributed file systems allow remote vol- data access. Yet, having transparent access to any remote umes to be accessed locally. A disk volume appears to be data is an important, practical capability. local, but it is actually accessed in client-server fashion from We have developed the Trellis File System (Trellis FS) the remote file server. For example, the Network File Sys- to allow programs to access data files on any file system tem (NFS) [11] and the Andrew File System (AFS) [7] are and on any host on a network that can be named by a Se- distributed file systems that have been used productively for cure Copy Locator (SCL) or a Uniform Resource Locator many years.
    [Show full text]