Provenance for Aggregate Queries
Total Page:16
File Type:pdf, Size:1020Kb
Provenance for Aggregate Queries Yael Amsterdamer Daniel Deutch Val Tannen University of Pennsylvania University of Pennsylvania University of Pennsylvania and Tel Aviv University and Ben Gurion University ABSTRACT R We study in this paper provenance information for queries EmpId Dept Sal with aggregation. Provenance information was studied in the 1 d1 20 p1 Dept context of various query languages that do not allow for ag- 2 d1 10 p2 d1 p1 + p2 + p3 gregation, and recent work has suggested to capture prove- 3 d1 15 p3 d2 r1 + r2 nance by annotating the different database tuples with ele- 4 d2 10 r1 ments of a commutative semiring and propagating the anno- 5 d2 15 r2 (b) tations through query evaluation. We show that aggregate queries pose novel challenges rendering this approach inap- (a) plicable. Consequently, we propose a new approach, where we annotate with provenance information not just tuples but Figure 1: Projection on annotated relations also the individual values within tuples, using provenance to ate the polynomials in the security semiring, and propagate describe the values computation. We realize this approach the security annotations through query evaluation (see Sec- in a concrete construction, first for “simple” queries where tion 2.1), assigning security levels to query results. the aggregation operator is the last one applied, and then for Let us briefly illustrate deletion propagation as an appli- arbitrary (positive) relational algebra queries with aggrega- cation of provenance. Consider a simple example of an em- tion; the latter queries are shown to be more challenging in ployee/department/salary relation R shown in Figure 1(a). this context. Finally, we use aggregation to encode queries The variables p1,p2,p3,r1,r2 can be thought of as tuple with difference, and study the semantics obtained for such identifiers and in the framework of provenance polynomi- queries on provenance annotated databases. als [24] they are the “provenance tokens” or “indeterminates” out of which provenance is built. We denote by N[X] the 1. INTRODUCTION set of provenance polynomials (here X = {p1,p2,p3,r1,r2}). R can be seen as an N[X]-annotated relation; as defined in The annotation of the results of database transformations [24] the evaluation of query, for example ΠDept R, produces with provenance information has quite a few applications another N[X]-annotated relation, in this example the one [19, 6, 35, 10, 11, 38, 34, 25, 27, 39, 37, 2, 4]. Recent shown in Figure 1(b). Intuitively, in this simple example, work [24, 17, 21] has proposed a framework of semiring an- the summation in the annotation of every result tuple is notations that allows us to state formally what is expected of over the identifiers of its alternative origins 1. such provenance information. These papers have developed Now, the result of propagating the deletions of tuples with the framework for the positive fragment of the relational EmpId 3 and 5 in R is obtained by simply setting p3 = arXiv:1101.1110v1 [cs.DB] 5 Jan 2011 algebra (as well as for Datalog, the positive Nested Rela- r2 = 0 in the answer. We get the same two tuples in the tional Calculus, and some query languages on trees/XML). query answer but their provenances change to p1 + p2 and The main goal of this paper is to extend the framework to r1, respectively. If the tuple with EmpId 4 is also deleted aggregate operations. from R then we also set r1 = 0, and the second tuple in the In the perspective promoted by these papers, provenance answer is deleted because its provenance has now become is a general form of annotation information that can be spe- 0. This algebraic treatment of deletions is related to the cialized for different purposes, such as multiplicity, trust, counting algorithm for view maintenance [26], but is more cost, security, or identification of “possible worlds” which general as it incrementally maintains not just the data but in turn applies to incomplete databases, deletion propaga- also the provenance. tion, and probabilistic databases. In fact, the introduction An intuitive way of understanding what happens is that of the framework in [24] was motivated by the need to track provenance-aware evaluation of queries conveniently “com- trust and deletion propagation in the Orchestra system [23]. mutes” with deletions. In fact, in [24, 17] this intuition is What makes such a diversity of applications possible is that captured formally by theorems that state that query eval- each is captured by a different semiring, while provenance is uation commutes with semiring homomorphisms. The fac- represented by elements of a semiring of polynomials. One torization through provenance relies on this and on the fact then relies on the property that any semiring-annotation that the polynomial provenance semiring is “freely gener- semantics factors through the provenance polynomials se- ated”. All applications of provenance polynomials we have mantics. This means that storing provenance polynomials listed, for trust, security, etc., are based on these theorems. allows for many other practical applications. For example, to capture access control, where the access to different tuples 1we explain how the annotations of query results are com- require different security credentials, we can simply evalu- puted in Section 2.1 Dept SalMass rather than multiplicities? Then, the aggregate value does d1 45 p1p2p3 not correspond to any number. d1 30 p1p2p3 Dept SalMass d d We will make p1 ×20 into an abstract construction, p1 ⊗20 1 35 p1p2pb3 1 30 p1p2 d d and the aggregate value will be the formal expression p1 ⊗ 1 25 p1pb2p3 1 20 p1p2 d d 20 + p2 ⊗ 10 + p3 ⊗ 15. 1 20 pb1p2p3 1 10 p1pb2 d Intuitively, we are embedding the domain of sum-aggregates, 1 10 p1pb2pb3 ··· ··· ·b · · d i.e., the reals R, into a larger domain of formal expres- 1 15 pb1p2pb3 ··· ··· · · · (b) sions that capture how the sum-aggregates are computed b b from values annotated with provenance. We do the same (a) for other kinds of aggregation, for instance min-aggregation gives p1 ⊗20 min p2 ⊗10 min p3 ⊗15. We call these annotated Figure 2: A naive approach to aggregation aggregate expressions. In this paper we consider only aggregations defined by Thus, commutation with homomorphisms is an essential commutative-associative operations 3. Specifically, our frame- criterion for our proposed framework extension to aggregate work can accommodate aggregation based on any commuta- operations. However, in Section 3.1 we prove that the frame- tive monoid. For example the commutative monoid for sum- work of semiring-annotated relations introduced in [24] can- mation is (R, +, 0) while the one for min is (R∞, min, ∞)4. not be extended to handle aggregation while both satisfying To combine an aggregation monoid M with an annotation commutation with homomorphisms and working as usual on commutative semiring K, in a way capturing aggregates over set or bag relations. K-relations, we propose the use of the algebraic structure of If the semiring operations are not enough then perhaps K-semimodules (see Section 2.2). Semimodules are a way of we can add others? This is a natural idea so we illustrate it generalizing (a lot!) the operations considered in linear alge- on the same R in Figure 1(a) and we use again the neces- bra. Its “vectors” form only a commutative monoid (rather sity to support deletion propagation to guide the approach. than an abelian group) and its “scalars” are the elements Consider the query that groups by Dept and sums Sal. The of K which is only a commutative semiring (rather than a result of the summation depends on which tuples partici- field). pate in it. To provide enough information to obtain all the In general, a commutative monoid M does not have an possible summation results for all possible sets of deletions, obvious structure of K-semimodule. To make it such we we could use the representation in Figure 2(a) where we add may need to add new elements corresponding to the scalar to the semiring operations an unary operation with the multiplication of elements of M with elements from K, thus property that p = 1 whenever p = 0. This will indeed sat- b ending up with the formal expressions that represent aggre- isfy the deletion criterion. For example when the tuple with b gate computations, motivated above, as elements of a tensor Id 3 is deleted we get the relation in Figure 2(b). In fact, product construction K ⊗M. We show that the use of tensor there exist semirings with the additional structure needed to product expressions as a formal representation of aggrega- define . For example in the semiring of polynomials with tion result is effective in managing the provenance of “simple integer coefficients, Z[X], we can take p = 1 − p while in b aggregate” queries, namely queries where the aggregation the semiring of boolean expressions with variables from X, b operators are the last ones applied. BoolExp(X), we can take p = ¬p 2. The latter is essentially We show that certain semirings are “compatible” with the approach taken in [31]. However, whether we use Z[X] b certain monoids, in the sense that the results of compu- or BoolExp(X), we still have, in the worst case, exponen- tation done in K ⊗ M may be mapped to M, faithfully tially many different results to account for, at least in the representing the aggregation results. Interestingly, compat- case of summation (a lower bound recognized also in [31]). It ibility is aligned with common wisdom: it is known that follows that summation in particular (and therefore any uni- some (idempotent) aggregation functions such as MIN and form approach to aggregation) cannot be represented with MAX work well for set relations, while SUM and PROD re- a feasible amount of annotation as long as the annotation quire the treatment of relations as bags.