Organizing, Indexing, and Searching Large-Scale File Systems

Total Page:16

File Type:pdf, Size:1020Kb

Organizing, Indexing, and Searching Large-Scale File Systems Organizing, Indexing, and Searching Large-Scale File Systems Technical Report UCSC-SSRC-09-09 December 2009 Andrew W. Leung [email protected] Storage Systems Research Center Baskin School of Engineering University of California, Santa Cruz Santa Cruz, CA 95064 http://www.ssrc.ucsc.edu/ UNIVERSITY OF CALIFORNIA SANTA CRUZ ORGANIZING, INDEXING, AND SEARCHING LARGE-SCALE FILE SYSTEMS A dissertation submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in COMPUTER SCIENCE by Andrew W. Leung December 2009 The Dissertation of Andrew W. Leung is approved: Ethan L. Miller, Chair Darrell D.E. Long Garth R. Goodson Tyrus Miller Vice Provost and Dean of Graduate Studies Copyright c by Andrew W. Leung 2009 Table of Contents List of Figures vi List of Tables xiii Abstract xvi Dedication xviii Acknowledgments xix 1 Introduction 1 1.1 MainContributions ............................... 3 1.2 Analysis of Large-Scale File System Properties . .......... 4 1.3 New Approaches to File Indexing . 5 1.4 Towards Searchable File Systems . 6 1.5 Organization.................................... 7 2 Background and Related Work 8 2.1 TheGrowingDataVolumeTrend. 8 2.2 ScalableStorageSystems. 9 2.2.1 ParallelFileSystems . 10 2.2.2 Peer-to-Peer File Systems . 12 2.2.3 OtherFileSystems ............................ 13 2.3 The Data Management Problem in Large-Scale File Systems .......... 15 2.3.1 Search-Based File Management . 18 2.3.2 File System Search Background . 18 2.3.3 Large-Scale File System Search Use Cases . 20 2.4 Large-Scale File System Search Challenges . 23 2.5 ExistingSolutions............................... 26 2.5.1 BruteForceSearch ............................ 26 2.5.2 Desktop and Enterprise Search . 27 2.5.3 SemanticFileSystems ..... ..... ...... ...... .... 28 2.5.4 SearchableFileSystems . 31 iii 2.5.5 RankingSearchResults . 31 2.6 IndexDesign ................................... 32 2.6.1 Relational Databases . 33 2.6.2 InvertedIndexes ............................. 36 2.6.3 File System Optimized Indexing . 40 2.7 Summary ..................................... 41 3 Properties of Large-Scale File Systems 42 3.1 Previous Workload Studies . 45 3.1.1 TheNeedforaNewStudy . 46 3.1.2 TheCIFSProtocol ............................ 47 3.2 Workload Tracing Methodology . 47 3.3 WorkloadAnalysis ................................ 49 3.3.1 Terminology ............................... 49 3.3.2 Workload Comparison . 50 3.3.3 Comparison to Past Studies . 52 3.3.4 FileI/OProperties ............................ 61 3.3.5 FileRe-Opens .............................. 63 3.3.6 Client Request Distribution . 65 3.3.7 FileSharing................................ 65 3.3.8 File Type and User Session Patterns . 69 3.4 DesignImplications..... ...... ..... ...... ...... .. 73 3.5 Previous Snapshot Studies . 74 3.6 Snapshot Tracing Methodology . 76 3.7 SnapshotAnalysis................................ 77 3.7.1 SpatialLocality.............................. 77 3.7.2 Frequency Distribution . 78 3.8 Summary ..................................... 81 4 New Approaches to File Indexing 82 4.1 MetadataIndexing ................................ 84 4.1.1 SpyglassDesign ............................. 85 4.1.2 Hierarchical Partitioning . 86 4.1.3 Partition Versioning . 92 4.1.4 Collecting Metadata Changes . 95 4.1.5 DistributedDesign ............................ 96 4.2 ContentIndexing ................................. 97 4.2.1 IndexDesign ............................... 97 4.2.2 QueryExecution ............................. 100 4.2.3 IndexUpdates .............................. 101 4.2.4 Additional Functionality . 102 4.2.5 Distributing the Index . 103 4.3 Experimental Evaluation . 104 iv 4.3.1 Implementation Details . 105 4.3.2 ExperimentalSetup. 106 4.3.3 Metadata Collection Performance . 107 4.3.4 UpdatePerformance . 108 4.3.5 SpaceOverhead ............................. 109 4.3.6 SelectivityImpact ............................ 110 4.3.7 SearchPerformance ...... ..... ...... ...... .... 111 4.3.8 IndexLocality .............................. 113 4.3.9 Versioning Overhead . 117 4.4 Summary ..................................... 118 5 Towards Searchable File Systems 119 5.1 Background .................................... 121 5.1.1 SearchApplications ...... ..... ...... ...... .... 121 5.1.2 Integrating Search into the File System . 123 5.2 Integrating Metadata Search . 124 5.2.1 MetadataClustering ...... ..... ...... ...... .... 125 5.2.2 IndexingMetadata . ...... ..... ...... ...... .... 129 5.2.3 QueryExecution ............................. 132 5.2.4 Cluster-Based Journaling . 133 5.3 Integrating Semantic File System Functionality . ........... 134 5.3.1 CopernicusDesign . ...... ..... ...... ...... .... 135 5.3.2 GraphConstruction. 137 5.3.3 ClusterIndexing ............................. 139 5.3.4 QueryExecution ............................. 140 5.3.5 ManagingUpdates . ...... ..... ...... ...... .... 141 5.4 ExperimentalResults . 142 5.4.1 Implementation Details . 142 5.4.2 Microbenchmarks . ...... ..... ...... ...... .... 144 5.4.3 Macrobenchmarks . ...... ..... ...... ...... .... 150 5.5 Summary ..................................... 155 6 Future Directions 157 6.1 Large-Scale File Systems Properties . 157 6.2 New Approaches to File Indexing . 158 6.3 Towards Searchable File Systems . 159 7 Conclusions 161 Bibliography 165 v List of Figures 2.1 Network-attached storage architecture. Multiple clients utilize a centralized server for access to shared storage. 9 2.2 Parallel file system architecture. Clients send metadata requests to metadata servers (MDSs) and data requests to separate data stores. .......... 11 2.3 Inverted index architecture. The dictionary is a data structure that maps key- words to posting lists. Posting lists contain postings that describe each occur- rence of the keyword in the file system. Searches require retrieving the posting lists for each keyword in the query. 37 3.1 Request distribution over time. The frequency of all requests, read requests, and write requests are plotted over time. Figure 3.1(a) shows how the request distribution changes for a single week in October 2007. Here request totals are grouped in one hour intervals. The peak one hour request total for corporate is 1.7 million and 2.1 million for engineering. Figure 3.1(b) shows the request dis- tribution for a nine week period between September and November 2007. Here request totals are grouped into one day intervals. The peak one day intervals are 9.4 million for corporate and 19.1 million for engineering. ........... 53 3.2 Sequential run properties. Sequential access patterns are analyzed for various sequential run lengths. The X-axes are given in a log-scale. Figure 3.2(a) shows the length of sequential runs, while Figure 3.2(b) shows how many bytes are transferred in sequential runs. 56 vi 3.3 Sequentiality of data transfer. The frequency of sequentiality metrics is plot- ted against different data transfer sizes. Darker regions indicate a higher frac- tion of total transfers. Lighter regions indicate a lower fraction. Transfer types are broken into read-only, write-only, and read-write transfers. Sequentiality metrics are grouped by tenths for clarity. 57 3.4 File size access patterns. The distribution of open requests and bytes trans- ferred are analyzed according to file size at open. The X-axes are shown on a log-scale. Figure 3.4(a) shows the size of files most frequently opened. Fig- ure 3.4(b) shows the size of files from which most bytes are transferred. 59 3.5 File open durations. The duration of file opens is analyzed. The X-axis is pre- sented on a log-scale. Most files are opened very briefly, although engineering files are opened slightly longer than corporate files. ......... 60 3.6 File lifetimes. The distributions of lifetimes for all created and deleted files are shown. Time is shown on the X-axis on a log-scale. Files may be deleted through explicit delete request or truncation. ......... 61 3.7 File I/O properties. The burstiness and size properties of I/O requests are shown. Figure 3.7(a) shows the I/O inter-arrival times. The X-axis is presented on a log-scale. Figure 3.7(b) shows the sizes of read and write I/O. The X-axis is divided into 8 KB increments. 62 3.8 File open properties. The frequency of and duration between file re-opens is shown. Figure 3.8(a) shows how often files are opened more than once. Fig- ure 3.8(b) shows the time between re-opens and time intervals on the X-axis are giveninalog-scale................................. 64 3.9 Client activity distribution The fraction of clients responsible for certain ac- tivitiesisplotted. ................................. 65 3.10 File sharing properties. We analyze the frequency and temporal properties of file sharing. Figure 3.10(a) shows the distribution of files opened by multiple clients. Figure 3.10(b) shows the duration between shared opens. The durations on the X-axis are in a log-scale. 66 vii 3.11 File sharing access patterns. The fraction of read-only, write-only, and read- write accesses are shown for differing numbers of sharing clients. Gaps are seen where no files were shared with that number of clients. 67 3.12 User file sharing equality. The equality of sharing is shown for differing num- bers of sharing clients. The Gini coefficient, which measures the level of equal- ity, is near 0 when sharing clients have about the same number of opens to a file. It is near 1 when clients unevenly share opens to a file. ......... 68 3.13 File type popularity. The histograms on the right show which file
Recommended publications
  • Online Layered File System (OLFS): a Layered and Versioned Filesystem and Performance Analysis
    Loyola University Chicago Loyola eCommons Computer Science: Faculty Publications and Other Works Faculty Publications 5-2010 Online Layered File System (OLFS): A Layered and Versioned Filesystem and Performance Analysis Joseph P. Kaylor Konstantin Läufer Loyola University Chicago, [email protected] George K. Thiruvathukal Loyola University Chicago, [email protected] Follow this and additional works at: https://ecommons.luc.edu/cs_facpubs Part of the Computer Sciences Commons Recommended Citation Joe Kaylor, Konstantin Läufer, and George K. Thiruvathukal, Online Layered File System (OLFS): A layered and versioned filesystem and performance analysi, In Proceedings of Electro/Information Technology 2010 (EIT 2010). This Conference Proceeding is brought to you for free and open access by the Faculty Publications at Loyola eCommons. It has been accepted for inclusion in Computer Science: Faculty Publications and Other Works by an authorized administrator of Loyola eCommons. For more information, please contact [email protected]. This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. Copyright © 2010 Joseph P. Kaylor, Konstantin Läufer, and George K. Thiruvathukal 1 Online Layered File System (OLFS): A Layered and Versioned Filesystem and Performance Analysis Joe Kaylor, Konstantin Läufer, and George K. Thiruvathukal Loyola University Chicago Department of Computer Science Chicago, IL 60640 USA Abstract—We present a novel form of intra-volume directory implement user mode file system frameworks such as FUSE layering with hierarchical, inheritance-like namespace unifica- [16]. tion. While each layer of an OLFS volume constitutes a subvol- Namespace Unification: Unix supports the ability to ume that can be mounted separately in a fan-in configuration, the entire hierarchy is always accessible (online) and fully navigable mount external file systems from external resources or local through any mounted layer.
    [Show full text]
  • A Searchable-By-Content File System
    A Searchable-by-Content File System Srinath Sridhar, Jeffrey Stylos and Noam Zeilberger May 2, 2005 Abstract Indexed searching of desktop documents has recently become popular- ized by applications from Google, AOL, Yahoo!, MSN and others. How- ever, each of these is an application separate from the file system. In our project, we explore the performance tradeoffs and other issues encountered when implementing file indexing inside the file system. We developed a novel virtual-directory interface to allow application and user access to the index. In addition, we found that by deferring indexing slightly, we could coalesce file indexing tasks to reduce total work while still getting good performance. 1 Introduction Searching has always been one of the fundamental problems of computer science, and from their beginnings computer systems were designed to support and when possible automate these searches. This support ranges from the simple-minded (e.g., text editors that allow searching for keywords) to the extremely powerful (e.g., relational database query languages). But for the mundane task of per- forming searches on users’ files, available tools still leave much to be desired. For the most part, searches are limited to queries on a directory namespace—Unix shell wildcard patterns are useful for matching against small numbers of files, GNU locate for finding files in the directory hierarchy. Searches on file data are much more difficult. While from a logical point of view, grep combined with the Unix pipe mechanisms provides a nearly universal solution to content-based searches, realistically it can be used only for searching within small numbers of (small-sized) files, because of its inherent linear complexity.
    [Show full text]
  • A Semantic File System for Integrated Product Data Management', Advanced Engineering Informatics, Vol
    Citation for published version: Eck, O & Schaefer, D 2011, 'A Semantic File System for Integrated Product Data Management', Advanced Engineering Informatics, vol. 25, no. 2, pp. 177-184. https://doi.org/10.1016/j.aei.2010.08.005 DOI: 10.1016/j.aei.2010.08.005 Publication date: 2011 Document Version Publisher's PDF, also known as Version of record Link to publication University of Bath Alternative formats If you require this document in an alternative format, please contact: [email protected] General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 02. Oct. 2021 Advanced Engineering Informatics 25 (2011) 177–184 Contents lists available at ScienceDirect Advanced Engineering Informatics journal homepage: www.elsevier.com/locate/aei A semantic file system for integrated product data management ⇑ Oliver Eck a, , Dirk Schaefer b a Department of Computer Science, HTWG Konstanz, Germany b Systems Realization Laboratory, Woodruff School of Mechanical Engineering, Georgia Institute of Technology, Atlanta, GA, USA article info abstract Article history: A mechatronic system is a synergistic integration of mechanical, electrical, electronic and software tech- Received 31 October 2009 nologies into electromechanical systems. Unfortunately, mechanical, electrical, and software data are Received in revised form 10 June 2010 often handled in separate Product Data Management (PDM) systems with no automated sharing of data Accepted 17 August 2010 between them or links between their data.
    [Show full text]
  • Using Virtual Directories Prashanth Mohan, Raghuraman, Venkateswaran S and Dr
    1 Semantic File Retrieval in File Systems using Virtual Directories Prashanth Mohan, Raghuraman, Venkateswaran S and Dr. Arul Siromoney {prashmohan, raaghum2222, wenkat.s}@gmail.com and [email protected] Abstract— Hard Disk capacity is no longer a problem. How- ‘/home/user/docs/univ/project’ directory lists the ever, increasing disk capacity has brought with it a new problem, file. The ‘Database File System’ [2], encounters this prob- the problem of locating files. Retrieving a document from a lem by listing recursively all the files within each directory. myriad of files and directories is no easy task. Industry solutions are being created to address this short coming. Thereby a funneling of the files is done by moving through We propose to create an extendable UNIX based File System sub-directories. which will integrate searching as a basic function of the file The objective of our project (henceforth reffered to as system. The File System will provide Virtual Directories which SemFS) is to give the user the choice between a traditional list the results of a query. The contents of the Virtual Directory heirarchial mode of access and a query based mechanism is formed at runtime. Although, the Virtual Directory is used mainly to facilitate the searching of file, It can also be used by which will be presented using virtual directories [1], [5]. These plugins to interpret other queries. virtual directories do not exist on the disk as separate files but will be created in memory at runtime, as per the directory Index Terms— Semantic File System, Virtual Directory, Meta Data name.
    [Show full text]
  • Privacy Engineering for Social Networks
    UCAM-CL-TR-825 Technical Report ISSN 1476-2986 Number 825 Computer Laboratory Privacy engineering for social networks Jonathan Anderson December 2012 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2012 Jonathan Anderson This technical report is based on a dissertation submitted July 2012 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Trinity College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Privacy engineering for social networks Jonathan Anderson In this dissertation, I enumerate several privacy problems in online social net- works (OSNs) and describe a system called Footlights that addresses them. Foot- lights is a platform for distributed social applications that allows users to control the sharing of private information. It is designed to compete with the performance of today’s centralised OSNs, but it does not trust centralised infrastructure to en- force security properties. Based on several socio-technical scenarios, I extract concrete technical problems to be solved and show how the existing research literature does not solve them. Addressing these problems fully would fundamentally change users’ interactions with OSNs, providing real control over online sharing. I also demonstrate that today’s OSNs do not provide this control: both user data and the social graph are vulnerable to practical privacy attacks. Footlights’ storage substrate provides private, scalable, sharable storage using untrusted servers. Under realistic assumptions, the direct cost of operating this storage system is less than one US dollar per user-year.
    [Show full text]
  • Enabling Security in Cloud Storage Slas with Cloudproof
    Enabling Security in Cloud Storage SLAs with CloudProof Raluca Ada Popa Jacob R. Lorch David Molnar Helen J. Wang Li Zhuang MIT Microsoft Research [email protected] {lorch,dmolnar,helenw,lizhuang}@microsoft.com ABSTRACT ers [30]. Even worse, LinkUp (MediaMax), a cloud storage Several cloud storage systems exist today, but none of provider, went out of business after losing 45% of client data them provide security guarantees in their Service Level because of administrator error. Agreements (SLAs). This lack of security support has been Despite the promising experiences of early adopters and a major hurdle for the adoption of cloud services, especially benefits of cloud storage, issues of trust pose a significant for enterprises and cautious consumers. To fix this issue, we barrier to wider adoption. None of today's cloud stor- present CloudProof, a secure storage system specifically de- age services|Amazon's S3, Google's BigTable, HP, Mi- signed for the cloud. In CloudProof, customers can not only crosoft's Azure, Nirvanix CloudNAS, or others|provide se- detect violations of integrity, write-serializability, and fresh- curity guarantees in their Service Level Agreements (SLAs). ness, they can also prove the occurrence of these violations For example, S3's SLA [1] and Azure's SLA [10] only guar- to a third party. This proof-based system is critical to en- antee availability: if availability falls below 99:9%, clients abling security guarantees in SLAs, wherein clients pay for are reimbursed a contractual sum of money. As cloud stor- a desired level of security and are assured they will receive age moves towards a commodity business, security will be a certain compensation in the event of cloud misbehavior.
    [Show full text]
  • Modern Infrastructure for Dummies®, Dell EMC #Getmodern Special Edition
    Modern Infrastructure Dell EMC #GetModern Special Edition by Lawrence C. Miller These materials are © 2017 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited. Modern Infrastructure For Dummies®, Dell EMC #GetModern Special Edition Published by: John Wiley & Sons Singapore Pte Ltd., 1 Fusionopolis Walk, #07-01 Solaris South Tower, Singapore 138628, www.wiley.com © 2017 by John Wiley & Sons Singapore Pte Ltd. Registered Office John Wiley & Sons Singapore Pte Ltd., 1 Fusionopolis Walk, #07-01 Solaris South Tower, Singapore 138628 All rights reserved No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted by the Copyright Act of Singapore 1987, without the prior written permission of the Publisher. For information about how to apply for permission to reuse the copyright material in this book, please see our website http://www.wiley.com/go/ permissions. Trademarks: Wiley, For Dummies, the Dummies Man logo, The Dummies Way, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries, and may not be used without written permission. Dell EMC and the Dell EMC logo are registered trademarks of Dell EMC. All other trademarks are the property of their respective owners. John Wiley & Sons Singapore Pte Ltd., is not associated with any product or vendor mentioned in this book. LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: WHILE THE PUBLISHER AND AUTHOR HAVE USED THEIR BEST EFFORTS IN PREPARING THIS BOOK, THEY MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS BOOK AND SPECIFICALLY DISCLAIM ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
    [Show full text]
  • VERSIONFS a Versatile and User-Oriented Versioning File System
    VERSIONFS A Versatile and User-Oriented Versioning File System A THESIS PRESENTED BY KIRAN-KUMAR MUNISWAMY-REDDY TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE STONY BROOK UNIVERSITY Technical Report FSL-03-03 December 2003 Abstract of the Thesis Versionfs A Versatile and User-Oriented Versioning File System by Kiran-Kumar Muniswamy-Reddy Master of Science in Computer Science Stony Brook University 2003 File versioning is a useful technique for recording a history of changes. Applications of ver- sioning include backups and disaster recovery, as well as monitoring intruders’ activities. Alas, modern systems do not include an automatic and easy-to-use file versioning system. Existing backup systems are slow and inflexible for users. Even worse, they often lack backups for the most recent day’s activities. Online disk snapshotting systems offer more fine-grained versioning, but still do not record the most recent changes to files. Moreover, existing systems also do not give individual users the flexibility to control versioning policies. We designed a lightweight user-oriented versioning file system called Versionfs. Versionfs works with any file system, whether local or remote, and provides a host of user-configurable policies: versioning by users, groups, processes, or file names and extensions; version retention policies by maximum number of versions kept, age, or total space consumed by versions of a file; version storage policies using full copies, compressed copies, or deltas. Versionfs creates file versions automatically, transparently, and in a file-system portable manner—while maintaining Unix semantics. A set of user-level utilities allow administrators to configure and enforce default policies; users are able to set policies within configured boundaries, as well as view, control, and recover files and their versions.
    [Show full text]
  • Orion File System : File-Level Host-Based Virtualization
    Orion File System : File-level Host-based Virtualization Amruta Joshi Faraz Shaikh Sapna Todwal Pune Institute of Computer Pune Institute of Computer Pune Institute of Computer Technology, Technology, Technology, Dhankavadi, Pune 411043, India Dhankavadi, Pune 411043, India Dhankavadi, Pune 411043, India 020-2437-1101 020-2437-1101 020-2437-1101 [email protected] [email protected] [email protected] Abstract— The aim of Orion is to implement a solution that The automatic indexing of files and directories is called provides file-level host-based virtualization that provides for "semantic" because user programmable transducers use better aggregation of content/information based on information about these semantics of files to extract the semantics and properties. File-system organization today properties for indexing. The extracted properties are then very closely mirrors storage paradigms rather than user- stored in a relational database so that queries can be run access paradigms and semantic grouping. All file-system against them. Experimental results from our semantic file hierarchies are containers that are expressed based on their system implementation ORION show that semantic file physical presence (a separate drive letter on Windows, or a systems present a more effective storage abstraction than the particular mount point based on the volume in Unix). traditional tree structured file systems for information We have implemented a solution that will allow sharing, storage and retrieval. users to organize their files
    [Show full text]
  • Metadata Efficiency in Versioning File Systems
    2nd USENIX Conference on File and Storage Technologies, San Francisco, CA, Mar 31 - Apr 2, 2003. Metadata Efficiency in Versioning File Systems Craig A.N. Soules, Garth R. Goodson, John D. Strunk, Gregory R. Ganger Carnegie Mellon University Abstract of time over which comprehensive versioning is possi- ble. To be effective for intrusion diagnosis and recovery, Versioning file systems retain earlier versions of modi- this duration must be greater than the intrusion detection fied files, allowing recovery from user mistakes or sys- latency (i.e., the time from an intrusion to when it is de- tem corruption. Unfortunately, conventional versioning tected). We refer to the desired duration as the detection systems do not efficiently record large numbers of ver- window. In practice, the duration is limited by the rate sions. In particular, versioned metadata can consume of data change and the space efficiency of the version- as much space as versioned data. This paper examines ing system. The rate of data change is an inherent aspect two space-efficient metadata structures for versioning of a given environment, and an analysis of several real file systems and describes their integration into the Com- environments suggests that detection windows of several prehensive Versioning File System (CVFS), which keeps weeks or more can be achieved with only a 20% cost in all versions of all files. Journal-based metadata encodes storage capacity [41]. each metadata version into a single journal entry; CVFS uses this structure for inodes and indirect blocks, reduc- In a previous paper [41], we described a prototype self- ing the associated space requirements by 80%.
    [Show full text]
  • The Sile Model — a Semantic File System Infrastructure for the Desktop
    The Sile Model — A Semantic File System Infrastructure for the Desktop Bernhard Schandl and Bernhard Haslhofer University of Vienna, Department of Distributed and Multimedia Systems {bernhard.schandl,bernhard.haslhofer}@univie.ac.at Abstract. With the increasing storage capacity of personal computing devices, the problems of information overload and information fragmen- tation become apparent on users’ desktops. For the Web, semantic tech- nologies aim at solving this problem by adding a machine-interpretable information layer on top of existing resources, and it has been shown that the application of these technologies to desktop environments is helpful for end users. Certain characteristics of the Semantic Web archi- tecture that are commonly accepted in the Web context, however, are not desirable for desktops; e.g., incomplete information, broken links, or disruption of content and annotations. To overcome these limitations, we propose the sile model, an intermediate data model that combines attributes of the Semantic Web and file systems. This model is intended to be the conceptual foundation of the Semantic Desktop, and to serve as underlying infrastructure on which applications and further services, e.g., virtual file systems, can be built. In this paper, we present the sile model, discuss Semantic Web vocabularies that can be used in the context of this model to annotate desktop data, and analyze the performance of typical operations on a virtual file system implementation that is based on this model. 1 Introduction A large amount of information is stored on personal desktops. We use our per- sonal computing devices—both mobile and stationary—to communicate, to write documents, to organize multimedia content, to search for and retrieve informa- tion, and much more.
    [Show full text]
  • Replication, History, and Grafting in the Ori File System
    Replication, History, and Grafting in the Ori File System Ali Jose´ Mashtizadeh, Andrea Bittau, Yifeng Frank Huang, David Mazieres` Stanford University Abstract backup, versioning, access from any device, and multi- user sharing—over storage capacity. But is the cloud re- Ori is a file system that manages user data in a modern ally the best place to implement data management fea- setting where users have multiple devices and wish to tures, or could the file system itself directly implement access files everywhere, synchronize data, recover from them better? disk failure, access old versions, and share data. The In terms of technological changes, disk space has in- key to satisfying these needs is keeping and replicating creased dramatically and has outgrown the increase in file system history across devices, which is now prac- wide-area bandwidth. In 1990, a typical desktop machine tical as storage space has outpaced both wide-area net- had a 60 MB hard disk, whose entire contents could tran- work (WAN) bandwidth and the size of managed data. sit a 9,600 baud modem in under 14 hours [2]. Today, Replication provides access to files from multiple de- $120 can buy a 3 TB disk, which requires 278 days to vices. History provides synchronization and offline ac- transfer over a 1 Mbps broadband connection! Clearly, cess. Replication and history together subsume backup cloud-based storage solutions have a tough time keep- by providing snapshots and avoiding any single point of ing up with disk capacity. But capacity is also outpac- failure. In fact, Ori is fully peer-to-peer, offering oppor- ing the size of managed data—i.e., the documents on tunistic synchronization between user devices in close which users actually work (as opposed to large media proximity and ensuring that the file system is usable so files or virtual machine images that would not be stored long as a single replica remains.
    [Show full text]