Ucsc-Soe-10-07

Design and Implementation of a Metadata-Rich File System Sasha Ames?y, Maya B. Gokhaley, and Carlos Maltzahn? ?University of California, Santa Cruz yLawrence Livermore National Laboratory Abstract Despite continual improvements in the performance and Application Application API: POSIX API: SQL API: POSIX + Query reliability of large scale file systems, the management Language RDBMS File System of user-defined file system metadata has changed little File System Query Query Rich Processor Processor Metadata in the past decade. The mismatch between the size and Data Rich Data complexity of large scale data stores and their ability to Metadata organize and query their metadata has led to a de facto Traditional Architecture Metadata-Rich File System standard in which raw data is stored in traditional file systems, while related, application-specific metadata is Figure 1: The Traditional Architecture (left), to manage stored in relational databases. This separation of data file data and user-defined metadata, places file data in and semantic metadata requires considerable effort to conventional file systems and and user-defined metadata maintain consistency and can result in complex, slow, in databases. In contrast, a metadata-rich file system and inflexible system operation. To address these prob- (right) integrates storage, access, and search of struc- lems, we have developed the Quasar File System (QFS), tured metadata with unstructured file data. a metadata-rich file system in which files, user-defined attributes, and file relationships are all first class objects. In contrast to hierarchical file systems and relational databases, QFS defines a graph data model com- data streams must be analyzed, annotated, and searched posed of files and their relationships. QFS incorporates to be useful; however, currently used file system archi- Quasar, an XPATH-extended query language for search- tectures do not meet these data management challenges. ing the file system. Results from our QFS prototype There are a variety of ad hoc schemes in existence to- show the effectiveness of this approach. Compared to day to attach user-defined metadata with files, such as a the de facto standard, the QFS prototype shows superior distinguished suffix, encoding metadata in the filename, ingest performance and comparable query performance putting metadata as comments in the file, or maintaining on user metadata-intensive operations and superior per- adjunct files related to primary data files. Application formance on normal file metadata operations. developers needing to store more complex inter-related metadata typically resort to the Traditional Architecture 1 Introduction approach shown on the left in Figure 1, storing data in file systems as a series of files and managing annotations The annual creation rate of digital data, already 468 ex- and other metadata in relational databases. An example abytes in 2008, is growing at a compound annual growth of this approach is the Sloan Digital Sky Survey [39, 40], rate of 73%, with a projected 10-fold increase over the in which sky objects and related metadata are stored in a next five years [19, 18]. Sensor networks of growing Microsoft SQL Server database and refer to the raw data size and resolution continue to produce ever larger data stored in regular file systems by absolute pathname. streams that form the basis for weather forecasting, cli- This approach likely emerged because of file sys- mate change analysis and modeling, and homeland secu- tems’ ability to store very large amounts of data, com- rity. New digital content, such as video, music, and docu- bined with databases’ superiority to traditional file sys- ments, also add to the world’s digital repositories. These tems in their ability to query data. Each complemented 1 the other’s weakness: file systems do not support flexi- between files. Our goal in exploring metadata-rich file ble queries to identify files according to their metadata systems is to examine their potential for the analysis and properties, and few databases can efficiently support the management of scientific, sensor, and text data. huge volume of data that must be stored. Unfortunately, Under the Traditional Architecture metadata and data this separation increases complexity and reduces perfor- are kept in different systems (see Figure 1, left). The sep- mance and consistency in several ways. First, the meta- aration has disadvantages in terms of complexity, consis- data must be cast into a relational database form, even tency and performance: though metadata and data conform more closely to a Brittle Schema—The application developer must design graph model. Then, application developer must design a schema specialized for the application. When new at- and build a relational database tailored to the application. tribute or link types are inserted, the schema must be re- As the application changes, the database schema might defined, and the database must be migrated to the new require modification, and all the metadata migrated to schema, a prohibitively expensive operation. the new schema. Using the database to retrieve meta- Brittle metadata/data association—The association of data involves a two-step process of evaluating a query metadata to files via POSIX file names is brittle. Large and resolving a potentially large number of file names. data streams require continual ingest of new data and Furthermore, the association between metadata and files de-staging of older data into archives. When files get via POSIX file names is brittle and can become inconsis- de-staged, their filesystem-specific POSIX path names tent when files are moved. Finally, queries cannot easily change. Updating the database requires extra mainte- be restricted to portions of the namespace. nance of indices with considerable update and querying The access profile of data stream ingest, annota- overhead. tion, and analysis-oriented querying does not require the Expensive path name evaluation—A query in the Tra- stringent semantics and overhead of database transac- ditional Architecture returns a list of file names that need tions [17], making it feasible to integrate a lighter-weight to be retrieved from the file system. Thus retrieving data index into the file system to facilitate the update and involves a two-step process of evaluating a query and re- query needs of many applications. solving a potentially large number of file names. To address these needs, we propose, implement and Global scope—Files are stored hierarchically. Filesys- evaluate a metadata-rich, queryable file system architec- tem directories align to semantic meaning and access lo- ture that maintains user-defined metadata as an intrin- cality [27]. Yet, the Traditional Architecture does not al- sic part of the file data, and simultaneously provides a low restricting the scope of queries to a directory without sophisticated metadata query interface. Rich metadata extra indexing overhead that is aggravated by the contin- extends POSIX file system metadata, such as standard ual stream of new data entering and older data leaving names, access rights, file types, and timestamps, to in- the filesystem. clude arbitrary user-defined data associated with a file, as well as linking relationships between files [2]. Al- In contrast (Figure 1, right), the metadata-rich file sys- though many existing file systems support storage of rich tem integrates the management of and provides a single metadata in extended attributes, none efficiently support interface for metadata and data with a general and flex- a graph data model with with attributed relationship links ible graph-based schema. Association between data and or integrate queries against all of the extended attributes metadata automatically remains consistent regardless of into file system naming. path name changes. For improved performance, such an The contributions of this paper are: (1) the design and integrated system can support combined data and meta- prototype implementation of the QFS metadata-rich file data writes. It becomes possible to append additional system based on a graph data model (2) the design and metadata items to existing files without schema changes. prototype implementation of the Quasar path-based file Files are identified by file IDs, so that query results re- system query language specifically designed for the data solve directly to files, obviating the need to resolve file model of files, links, and attributes. (3) quantitative eval- names. The query interface, based on the XPATH stan- uation of QFS compared to the Traditional Architecture dard, extends the POSIX file system interface with syn- of hierarchical file system plus relational database. tax to select files matching arbitrary metadata character- istics while allowing the query to limit the scope of such selections using path names. 2 A Metadata-Rich File System We define a metadata-rich file system as one that aug- 2.1 Data Model ments conventional file system I/O services (such as the ones defined by POSIX) with an infrastructure to store We represent rich metadata using file attributes (similar and query user-defined file metadata and attributed links to extended attributes as defined in POSIX), directional 2 links between files,1 and attributes attached to links [3]. 2.2 Query Language File attributes include traditional file system metadata The Quasar query language is an integral part of the (similar to the Inversion File System [32]). A link is metadata-rich file system that use the graph data mod- a first-class file system object representing a directional ell [4, 5]. Quasar expressions are designed to replace edge from a parent file to a child file, as shown in Fig- POSIX paths in file system calls and are used as names to ure 2. query and manipulate the metadata for files. By integrat- In Figure 2, each file (circle) has a set of attributes ing querying with naming, Quasar avoids full path name in the form of attribute name/value pairs. Files are con- evaluation required by the Traditional Architecture.

Load more