Story Book: an Efficient Extensible Provenance Framework

Story Book: An Efficient Extensible Provenance Framework R. Spillane, R. Sears∗, C. Yalamanchili, S. Gaikwad, M. Chinni, and E. Zadok Stony Brook University and ∗University of California, Berkeley Abstract the changes made by applications to .txt files, and the other records modifications to .docx metadata, such as Most application provenance systems are hard coded the author name. for a particular type of system or data, while cur- Our experiments show that Story Book’s storage and rent provenance file systems maintain in-memory prove- query performance are adequate for file system provenance graphs and reside in kernel space, leading to com- nance workloads where each process performs a hand- plex and constrained implementations. Story Book re- ful of provenance-related actions. However, systems sides in user space, and treats provenance events as a that service many small requests across a wide range generic event log, leading to a simple, flexible and eas- of users, such as databases and Web servers, gener- ily optimized system. ate provenance information at much higher rates and We demonstrate the flexibility of our design by adding with finer granularity. As such systems handle most provenance to a number of different systems, including of today’s multi-user workloads, we believe they also a file system, database and a number of file types, and by provide higher-value provenance information than in- implementing two separate storage backends. Although teractive desktop systems. This space is where Story Story Book is nearly 2.5 times slower than ext3 under Book’s extensible user-space architecture and high write worst case workloads, this is mostly due to FUSE mes- throughput are most valuable. sage passing overhead. Our experiments show that cou- The rest of this paper is organized as follows: Sec- pling our simple design with existing storage optimiza- tion 2 describes existing approaches to provenance, Sec- tions provides higher throughput than existing systems. tion 3 describes the design and implementation of Story Book, and Section 4 describes our experiments and per- 1 Introduction formance comparisons with PASSv2. We conclude in Existing provenance systems are designed to deal with Section 5. specific applications and system architectures, and are difficult to adapt to other systems and types of data. 2 Background Story Book decouples the application-specific aspects of A survey of provenance systems [14] provides a taxon- provenance tracking from dependency tracking, queries omy of existing provenance databases. A survey of the and other mechanisms common across provenance sys- taxonomy shows that Story Book is flexible enough to tems. cover much of the design space targeted by existing sys- Story Book runs in user space, simplifying its im- tems. This flexibility comes from Story Book’s layered, plementation, and allowing it to make use of existing, modular approach to provenance tracking (Figure 1). off-the-shelf components. Implementing provenance A provenance source intercepts user interaction (or any other) file system in user space incurs signifi- events with application data and sends these events to cant overhead. However, our user space design signif- application specific extensions which interpret them and icantly reduces communication costs for Story Book’s generate provenance inserts into one of Story Book’s application-specific provenance extensions, which may storage backends. Queries are handled by Story Book’s use existing IPC mechanisms, or even run inside the pro- Story Book API. Story Book relies on external li- cess generating provenance data. braries to implement each of its modules. FUSE [8] Story Book supports application-specific extensions, and MySQL [18] intercept file system and database allowing it to bring provenance to new classes of sys- events. Extensions to Story Book’s FUSE file systems. It currently supports file system and database tem, such as the .txt and .docx modules, anno- provenance and can be extended to other types of sys- tate events with application-specific provenance. These tems, such as Web or email servers. Story Book’s file provenance records are then inserted into either Sta- system provenance system also supports extensions that sis [12] or Berkeley DB [15]. Stasis stores provenance record additional information based on file type. We data using database-style no-Force/Steal recovery for its have implemented two file-type modules: One records hashtables, and compressed log structured merge (LSM) Provenance Sources PASS Linux Kernel FUSE FS MySQL Network logs Events FS instrumentation Network consensus protocol ... Durability Protocols ( Protocols Durability Processs hierarchy Processs Event timestamps Graph pruning Cycle detection and breaking Application Specific Extensions Inserts Event Reordering WAL implementation TXT Versioning DOCX ... WALDO Valor) Story Book API Realtime Query ... WAL consumption + indexing Storage Backends Queries over stale data Stasis / Rose BDB ... Compression Figure 2: PASSv2’s monolithic approach to provenance tracking. Although Story Book contains more components than PASSv2, Story Book addresses a superset of PASSv2’s ap- Figure 1: Story Book’s modular approach to provenance track- plications, and leverages existing systems to avoid complex, ing. provenance-specific code. trees [11] for its Rose [13] indexes. Story Book utilizes need to be extended to understand annotations such Valor [16] to maintain write-ordering between its logs as “version-of” or “derived-from,” and act accordingly. and the kernel page cache. This reduces the number of Applications could implement custom queries that make disk flushes, greatly improving performance. use of these attributes, but such queries would incur sig- Although Story Book utilizes a modular design that nificant I/O cost. This is because our schema stores (po- separates different aspects of its operation into external tentially large) application-specific annotations in a sep- modules (i.e., Rose, Valor, and FUSE), it does hardcode arate physical location on disk. Of course, user-defined some aspects of its implementation. Simhan’s survey annotations could be stored in the provenance graph, but discusses provenance systems that store metadata sep- this would increase complexity, both in the schema, and arately from application data, and provenance systems in provenance query implementations. that do not. One design decision hardcoded into Story Databases such as Trio [20] reason about the quality Book is that metadata is always stored separately from or reliability of input data and processes that modify it application data for performance reasons. Story Book over time. Such systems typically cope with provenance stores blob data in a physically separate location on disk information, as unreliable inputs reflect uncertainty in in order to preserve locality of provenance graph nodes the outputs of queries. Story Book targets provenance and efficiently support queries. over exact data. This is especially appropriate for file Story Book avoids hardcoding most of the other de- system provenance and regulatory compliance applica- sign decisions mentioned by the survey. Because Story tions; in such circumstances, it is more natural to ask Book does not directly support versioning, application- which version of an input was used rather than to calcu- specific provenance systems are able to decide whether late the probability that a particular input is correct. to store logical undo/redo records of application opera- tions to save space and improve performance, or to save 2.1 Comparison to PASS raw value records to reduce implementation time. Story Existing provenance systems couple general provenance Book’s .txt module stores patches between text files concepts to system architectures, leading to complex in- rather than whole blocks or pages that were changed. kernel mechanisms and complicating integration with Story Book allows applications to determine the gran- application data representations. Although PASSv2 [9] ularity of their provenance records, and whether to fo- suffers from these issues, it is also the closest system to cus on information about data, processes or both. Simi- Story Book (Figure 2). larly, Story Book’s implementation is independent of the PASSv2 distinguishes between data attributes, such format of extended records, allowing applications to use as name or creation time, and relationships indicat- metadata standards such as RDF [19] or Dublin Core [2] ing things like data flow and versioning. Story Book as they see fit. performs a similar layering, except that it treats ver- The primary limitation of Story Book compared to sioning information (if any) as an attribute. Further- systems that make native use of semantic metadata is more, whereas PASSv2 places such mechanisms in ker- that Story Book’s built-in provenance queries would nel space, Story Book places them in user space. On the one hand, this means that user-level prove- is important when tracking database provenance, as it nance inspectors could tamper with provenance infor- allows Story Book to reuse existing durability mecha- mation; on the other it has significant performance and nisms, such as Valor, and inexpensive database replica- usability advantages. In regulatory environments (which tion techniques. For our experiments, we extended write cope with untrusted end-users), Story Book should run ahead provenance with some optimizations employed in a trusted file or other application server. The security by Fable so that PASSv2’s

Story Book: an Efficient Extensible Provenance Framework

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support