Insight: a Semantic File System 1.1 Motivation

: A Semantic File System Final Report June 18th, 2008 David Ingram Department of Computing Imperial College London [email protected] Supervisor: Dr Peter McBrien, [email protected] Second Marker: Dr Chris Hogger, [email protected] Abstract For over 30 years, hierarchical file systems have enforced a limiting mode of thinking upon users, requiring them to organise their files into specific paths. Despite this, humans naturally tend to associate objects in the physical world with a set of loosely-defined attributes, rather than a well-defined position. Many users will say that they cannot locate or organise the files they have created, either because they can no longer remember the names they gave the files, or because they can only recall information about the subject of the files or data contained therein. Users therefore revert to searches, which may be time-consuming and frustrating, as they may not be able to search for the data they can recall. In order to provide a closer match to the way the mind organises data, file structure should not be solely based upon one unique hierarchical location but custom semantic attributes assigned to the data. These attributes may take the form of keywords (e.g. amusing), structured keywords (document.report), key–value pairs (filetype = ‘mp3 ’) or more complex structures. The aim of this project, therefore, is to build a proof-of-concept semantic file system for Linux, providing fast keyword- and key–value-based indexes that will allow users to find and structure their files as they require, without needing to resort to awkward searches. This system should be available to every pro- gram that deals with files, offering a consistent method of locating data that is backwards-compatible with the existing hierarchical system. Acknowledgements A project such as this can never be the work of just one person. Many people have provided input, advice and assistance, and without them it would not have been possible. I would therefore like to thank the following people for their involvement with this project: Dr. Peter McBrien, for kindly agreeing to supervise my project proposal and particularly for his assistance in the late stages of the project. Dr. Chris Hogger, for his suggestions and enthusiasm. And to those who have contributed in other ways: Nick Pope, for keeping my visions realistic and for some stylistic assistance with LATEX. Katie Stevens, for not begrudging me the time spent on this project, for her superior mastery of the English language, and for providing the viewpoint of a normal user. David Durant, for his advice, suggestions, and not least his criticisms that helped shape my reports. My housemates, for their advice, suggestions, and cooking skills. This project comes at the end of four years’ study at Imperial, and it would be impossible to mention by name everybody who has shared it with me. I would therefore like to thank my friends and family for their support, and for sharing both the good and bad experiences throughout my degree. I could not have done this without you. Contents 1 Introduction 1 1.1 Motivation ...................................... 2 1.2 Aims............................................ 3 1.3 Tradeoffs ......................................... 3 1.4 Solution.......................................... 4 1.5 Noteonterminology ................................. 4 2 History 5 2.1 Earlyfilesystems................................... 5 2.2 Latertechnology ..................................... 5 2.3 Recent file systems . 6 2.4 Thefuture......................................... 7 3 Current State of the Art 9 3.1 Documentstores ..................................... 9 3.1.1 MicrosoftWinFS................................. 9 3.1.2 GNOMEStorage................................. 10 3.1.3 BFS ........................................ 10 3.1.4 Tagsistant.................................... 10 3.1.5 MWFS ...................................... 11 3.2 Metadata indexes . 11 3.2.1 DBFS ....................................... 11 3.2.2 AppleSpotlight.................................. 11 3.2.3 GoogleDesktop.................................. 12 3.2.4 Beagle....................................... 12 3.3 Semantic file system papers . 12 3.3.1 SFS ........................................ 12 3.3.2 pStore....................................... 13 3.3.3 Automated Attribute Assignment . 14 3.4 Other solutions . 14 3.5 Summary ......................................... 14 i 4 Design 17 4.1 Files and directories . 17 4.2 Pathsyntax...................................... 17 4.3 Datastorage ..................................... 19 4.4 B+tree .......................................... 20 4.4.1 Datastoragerequirements. 21 4.5 Query engine . 21 4.6 Limitations ...................................... 22 5 Implementation 25 5.1 B+treeprototype ................................... 25 5.1.1 On-diskformat.................................. 25 5.1.2 Multipletrees................................... 27 5.2 Querytrees ........................................ 27 5.2.1 Query tree nodes . 27 5.2.2 Queryextraction ................................. 27 5.3 FUSEimplementation .................................. 31 5.3.1 IntroductiontoFUSE .............................. 31 5.3.2 Inodeassignment................................. 31 5.3.3 Filesystemoperations ............................. 32 5.3.4 Directory listing generation . 32 6 Evaluation 35 6.1 Meetingtheaims..................................... 36 6.1.1 Primaryaims................................... 36 6.1.2 Secondaryaims.................................. 36 6.2 Limitations of this solution . ..... 37 7 Conclusion 39 7.1 Project conclusion . 39 7.2 Futurework........................................ 39 7.2.1 Extendingthesyntax .............................. 39 7.2.2 GUIintegration................................. 40 7.2.3 Extended attribute integration . 40 7.2.4 Additional query operators . 40 7.2.5 Dynamic query expressions . 40 7.2.6 File system visualisation tools . ...... 40 7.2.7 Datatypes .................................... 41 7.2.8 Directoryschemas ................................ 41 ii List of Figures 2.1 Flatfilesystem..................................... 5 2.2 Hierarchical file system . 6 4.1 Two views of the path /uni/year4/419/cw.tex ................... 18 4.2 Pathsyntax...................................... 19 4.3 Set visualisation of the path /uni/year/:/4/course:/419/ ............. 19 4.4 Path canonicalisation and tag extraction . ..... 20 4.5 Sample‘tags’tablelayout ............................. 20 4.6 Overview of a generic Insight B+tree ......................... 21 4.7 Two possible directory listings for /insight/TV/:/episode/ ............ 22 4.8 A sample (partial) directory tree . 23 5.1 On-diskformat ...................................... 26 ` 5.2 Query tree construction for /uni/year`4/course 419 ................ 29 5.3 Examples of query trees built from paths . 30 5.4 The path of a file system request via FUSE . 31 5.5 Examples of directory listings . 34 iii iv List of Tables 5.1 Values of constants assuming block size of 512 . 26 5.2 Query node types . 27 5.3 FUSE file system operations . 33 v vi Chapter 1 Introduction Therefore, since brevity is the soul of wit, And tediousness the limbs and outward flourishes, I will be brief Hamlet Act 2, scene 2 File systems are an integral part of day-to-day computer use that are often taken for granted and therefore often overlooked. They have remained largely unchanged semantically for thirty years, and the main innovations have been related to clustering, network file systems, and reliability. There have been no real changes to the way in which users interact with the system. Organisation is a difficult problem which users face every time they deal with files. Most users have evolved personal systems of categorising their files. However good these may be, they cannot be perfect. Just as electronic files are intended to be a superior analogue of paper files (and very often are), electronic folders should be more powerful and more flexible versions of their physical counterparts. In order to improve upon the concept of folder-based storage in the majority of existing file systems, the major contributions of this project are: Design of a semantic file system (Chapter 4), including a syntax for legacy application access (Sections 4.1 & 4.2) and query engine (Section 4.5) Implementation of a proof-of-concept semantic file system for Linux (Chapter 5) The remainder of this chapter deals with the project as a whole, while Chapters 2 and 3 cover background material referred to elsewhere in this report. As this project has very broad scope, a number of suggestions for future work have also been included (Section 7.2) 1 Chapter 1: Introduction Insight: A Semantic File System 1.1 Motivation The main motivation behind this project is frustration with the current methods for organising files based on arbitrarily-chosen classification. The key problem: Files can only (generally) exist in one location in a file system hierarchy, but there may be multiple appropriate categorisations. Deciding how to organise one’s files can be difficult. Files can only be placed in one folder, which can be considered a category. Unless those categories are well-chosen, it may be difficult to locate files easily. Unfortunately, there are a large number of possible categorisations for files. This can be illustrated with the following examples, each listed with potential difficulties: Organising files by file extension – Different folders for .txt, .doc, and .pdf

Insight: a Semantic File System 1.1 Motivation

File Protection – Using Rsync Whitepaper

Copy on Write Based File Systems Performance Analysis and Implementation

Cyber502x Computer Forensics

Partitioner Och Filsystem 2

Refs: Is It a Game Changer? Presented By: Rick Vanover, Director, Technical Product Marketing & Evangelism, Veeam

Jim Allchin on Longhorn, Winfs, 64-Bit and Beyond Page 34 Jim

Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO

A Searchable-By-Content File System

A Semantic File System for Integrated Product Data Management', Advanced Engineering Informatics, Vol

DB/C Newsletter February 2006

IT Acronyms.Docx

Using Hierarchical Folders and Tags for File Management