A Generic Provenance Middleware for Database Queries, Updates, and Transactions

A Generic Provenance Middleware for Database Queries, Updates, and Transactions Bahareh Sadat Arab Dieter Gawlick Venkatesh Radhakrishnan Illinois Institute of Technology Oracle Corporation Oracle Corporation [email protected] [email protected] [email protected] Hao Guo Boris Glavic Illinois Institute of Technology Illinois Institute of Technology [email protected] [email protected] Abstract lational encoding of provenance annotations. These systems apply We present an architecture and prototype implementation for a query rewrite techniques to transform a query q into a query that generic provenance database middleware (GProM) that is based propagates input annotations to produce the result of q annotated on the concept of query rewrites, which are applied to an algebraic with provenance. This approach has many advantages. It benefits graph representation of database operations. The system supports from existing database technology, e.g., provenance computations a wide range of provenance types and representations for queries, are optimized by the database optimizer. Queries over provenance updates, transactions, and operations spanning multiple transac- can be expressed as SQL queries over the relational encoding [11]. tions. GProM supports several strategies for provenance genera- Alternatively, we can compile a special-purpose provenance query tion, e.g., on-demand, rule-based, and “always on”. To the best of language into SQL queries over such an encoding [4, 14]. In spite our knowledge, we are the first to present a solution for comput- of the advances made, the current state-of-the-art falls short in sev- ing the provenance of concurrent database transactions. Our solu- eral aspects: tion can retroactively trace transaction provenance as long as an • Even though provenance of relational updates (we use update audit log and time travel functionality are available (both are sup- as an umbrella term for DML operations) is relatively well un- ported by most DBMS). Other noteworthy features of GProM in- derstood [2, 5, 15, 17], no comprehensive implementation exists. clude: extensibility through a declarative rewrite rule specification Furthermore, no solution for tracking transactions has been pro- language, support for multiple database backends, and an optimizer posed so far. for rewritten queries. • Systems are inflexible in their support for deciding when to com- Categories and Subject Descriptors H.2.4 [Relational databases] pute provenance, when to store it, and how to store it. For example, Trio [1], DBNotes [4], and ORCHESTRA [15] compute and gen- Keywords Provenance, Databases, Query Rewrite, Transactions erate provenance for all operations and systems such as Perm [11] do not compute any provenance unless it is explicitly requested. 1. Introduction • Queries produced by query rewrites use atypical access patterns Provenance tracking for database operations, i.e., automatically and operator sequences which often leads to poor execution plans, collecting and managing information about the origin of data, has even for database systems with sophisticated optimizers. received considerable interest from the database community in the • Most systems only support one type of provenance using one last decade. Efficiently generating and querying provenance is es- particular representation of provenance. sential for debugging data and queries, evaluating trust measures for data, defining new types of access control models, auditing, In this work, we present our vision and a prototype imple- and as a supporting technology for data integration and proba- mentation for GProM, a generic provenance middleware, that ad- bilistic databases. The de-facto standard for database provenance dresses the aforementioned problems. We use annotation propaga- is to model provenance as annotations on data and compute the tion and query rewrite techniques for computing, querying, storing, provenance for the outputs of an operation by propagating annota- and translating the provenance of SQL queries, updates, transactions [1, 4, 11, 13, 15]. Many provenance systems [11] use a re- tions, and across transactions. To the best of our knowledge, we are the first to present an approach for tracking the provenance of concurrent transactions (our prototype supports snapshot isolation [3] 1 Permission to make digital or hard copies of all or part of this work for personal or and statement snapshot isolation ). We will support a wide vari- classroom use is granted without fee provided that copies are not made or distributed ety of database backends from mature DBMS to SQL processors for profit or commercial advantage and that copies bear this notice and the full citation running over distributed analytics platforms. on the first page. Copyrights for components of this work owned by others than ACM We first present an overview of the system in Section 2 and must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a then describe our main contribution, provenance computation for fee. Request permissions from [email protected]. updates and transactions, in Section 3. The prototype implementa- CONF ’yy, Month d–d, 20yy, City, ST, Country. tion of GProM will be presented in Section 4. Appendix A shows Copyright c 20yy ACM 978-1-nnnn-nnnn-n/yy/mm. $15.00. http://dx.doi.org/10.1145/nnnnnnn.nnnnnnn 1 To be explained later tionality is sufficient for computing the provenance of past queries using simple modifications of standard provenance rewrites [7, 18]. Our main contribution is to demonstrate that this is also sufficient PROVENANCE OF Policy RSL (SELECT * FROM ... User for tracking the provenance of updates and transactions. If the user requests provenance for a transaction T , the transaction reenac- 1ParserParser GProM 4 8 Parser Provenance RSL tor 3 extracts the list of SQL statements executed by T from the Storage Manager Manager audit log and constructs a reenactment query q(T ) that simulates Provenance RewriterProvenance the effects of these statements. We use the provenance rewriter to ProvenanceRewriter 2 + RewriterProvenance rewrite q(T ) into a query q(T ) that computes the provenance Rewriter Policy Policy 3 PolicyPolicy PolicyRSL Transaction of the reenacted transaction. Note that the construction of q(T ) Reenactor is independent of the provenance2 rewrite and q(T ) is a standard relational query. Using this approach, we can compute any type 5 of provenance for updates, transactions, and across transactions as Provenance Translator long as rewrite rules for computing the provenance of queries have been implemented for this provenance type. SQL Code 6 7GeneratorSQL Code Optimizer Provenance Generation and Storage Policies: As explained GeneratorSQL Code Generator above we can reconstruct provenance for any past query, update, or transaction using the audit log and time travel. Thus, explicit SELECT * FROM ... provenance storage is unnecessary. The default in GProM is to only compute provenance if it is explicitly requested. Nonetheless, we also support automatic provenance generation and storage for Query LogQuery LogVersionedQuery Log VersionedTablesQuery Log VersionedTables VersionedTables Tables Versioned Tables -- --- --- Audit Log -- --- --- -- --- --- -- --- --- use cases where retroactive reconstruction of provenance is not an -- -- --- -- -- - - - -- --- --- -- -- --- -- -- - - --- -- --- -- -- - - --- -- --- -- -- - - - ----- -- -- --- -- -- - - - ----- ----- ----- --- --- - --- option. The user can register provenance storage policies with the ----- --- --- - --- --- --- - --- --- --- - --- --- --- - --- storage manager 6 . These policies determine when and how to generate and store provenance. Figure 1: GProM Overview Optimizing Rewritten Queries: GProM will include an optimizer 7 which applies heuristic and cost-based rules to transform rewritten queries into SQL code that can be successfully optimized how GProM processes an example provenance request for a trans- by the underlying DBMS. This is necessary, because provenance action to translate it into SQL code. We present additional details rewrites generate queries with unusual access patterns and operator about the extensibility mechanism, database independence, prove- sequences. Even sophisticated database optimizers are not capable nance translations, provenance import/export, optimizer, and stor- of producing reasonable plans for such queries. age policies, in Appendices B, C, D, E, F, and G. Related work will be discussed inline where appropriate. Rewrite Extensibility: GProM relies on query rewrites for provenance computation, optimizations, transaction support, provenance storage, and translation between provenance representations. Thus, 2. System Overview the main bottleneck for extending the system with new function- Figure 1 shows an overview of GProM. The user interacts with the ality is implementing new rewrites. We will develop a declarative system using an extension of the underlying database system’s SQL rewrite specification language (RSL) for expressing rewrite rules dialect. Specifically, we support new language constructs for com- in a concise and easy to understand manner. User and system- puting and managing provenance (similar to Perm [11]). Incoming developer provided rewrites will be stored in a repository 8 . Rules statements are translated into a relational algebra graph represen- are applied to input queries using an interpreter

A Generic Provenance Middleware for Database Queries, Updates, and Transactions

Blockchain Database for a Cyber Security Learning System

An Introduction to Cloud Databases a Guide for Administrators

Middleware-Based Database Replication: the Gaps Between Theory and Practice

Database Technology for Bioinformatics from Information Retrieval to Knowledge Systems

What Is a Database? Differences Between the Internet and Library

Data Quality Requirements Analysis and Modeling December 1992 TDQM-92-03 Richard Y

DBOS: a Proposal for a Data-Centric Operating System

Multimedia Database Support for Digital Libraries

A Longitudinal Study of Database Usage Within a General Audience Digital Library Abstract 1. Introduction and Literature Revie

Edsger Wybe Dijkstra

Digital Library Resources

Oracle Database Upgrade: Quick Start Guide