Fine-Grained Provenance for Linear Algebra Operators
Total Page:16
File Type:pdf, Size:1020Kb
Fine-grained provenance for linear algebra operators Zhepeng Yan Val Tannen Zachary G. Ives University of Pennsylvania fzhepeng, val, [email protected] Abstract A u v Provenance is well-understood for relational query operators. In- x B C I 0 creasingly, however, data analytics is incorporating operations ex- pressed through linear algebra: machine learning operations, net- work centrality measures, and so on. In this paper, we study prove- y D E 0 I nance information for matrix data and linear algebra operations. Sx Sy Our core technique builds upon provenance for aggregate queries and constructs a K−semialgebra. This approach tracks prove- I 0 Tu nance by annotating matrix data and propagating these annotations I = iden'ty matrix through linear algebra operations. We investigate applications in 0 = zero matrix matrix inversion and graph analysis. 0 I Tv 1. Introduction For many years, data provenance was modeled purely as graphs Figure 1. Provenance partitioning and its selectors capturing dataflows among “black box” operations: this repre- sentation is highly flexible and general, and has ultimately led erations such as addition, multiplication and inversion. We show to the PROV-DM standard. However, the database community building blocks towards applications in principal component anal- has shown that fine-grained provenance with “white-box” oper- ysis (PCA), frequently used for dimensionality reduction, and in ations [3] (specifically, for relational algebra operations) enables computing PageRank-style scores over network graphs. richer reasoning about the relationship between derived results and data provenance. The most general model for relational algebra queries, provenance semirings [7, 1], ensures that equivalent re- 2. Tracking matrix data lational algebra expressions, as produced by a query optimizer, Consider a d×n matrix A, and, as is often the case in machine have equivalent provenance encodings. Moreover, there is a natu- learning, consider an algorithm that begins by computing AAT . ral means of computing trust [6] and authoritativeness [16] scores In a typical situation A records n samples from a d-dimensional for data, using provenance semiring expressions. Provenance can feature space. Suppose we wish to track the provenance of the data even be used to support incremental updates to views [6]. Finally, by partitioning both the features and samples. In a very simple there are probabilistic mechanisms for learning weights or author- example the features might be of two types and we associate the itativeness scores for different provenance tokens in the semiring provenance tokens x, respectively y, with these types of features. model [16]. Suppose also that the samples come from two sources and we as- A natural question is whether the provenance semiring approach, sociate the provenance tokens u, respectively v, with these sources. to this point limited to variations on relational queries, can be ex- Again for simplicity, we assume that all the features associated with tended to a greater subset of data analytics tasks, e.g., to other “bulk x correspond to adjacent rows, and similar for y; u; v. This parti- operations” over collections of data. We present a preliminary set tions A into four submatrices B; C; D; E, see Figure 1. of provenance primitives, for one such class of operations: those from linear algebra. Matrices and linear algebra are heavily used in Let x; y; u; v be from some set X of provenance tokens and let many data analytics applications, including image detection, signal N[X] be the provenance semiring of polynomials developed in [7]. processing, network science, and machine learning. While prove- Now we wish to annotate the submatrix B with the polynomial nance libraries have been developed for matrix operations [17], no xu corresponding to its source and feature subset, and similarly C techniques currently exist for automatically extracting provenance with xv, D with yu, and E with yv. From this, we will later want from the various operations, and for reasoning based on this prove- to track which sources and feature subsets were used to produce nance. Our work in this paper develops primitives for matrix op- which portions of an output matrix. Observe that A = Sx B Tu + Sx C Tv + Sy D Tu + Sy E Tv with appropriate selectors Sx; Sy; Tu; Tv. Selectors (see Section 4) are certain matrices whose elements are 0 or 1, and serve like a Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee “mask” specifying entries to exclude or include, respectively. In provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and our case, let p, respectively q, be the number of rows of provenance the full citation on the first page. TaPP 2016, June 8–9, 2016, Washington, DC. x, respectively of provenance y, where p + q = d. Let also Copyright remains with the owner/author(s). k, respectively l, be the number of columns of provenance u, respectively of provenance v, where k + l = n. The dimensions of Although in Section 2 we only made use of embedding the ring the selectors are Sx : d×p; Sy : d×q; Tu : k×n; Tv : l×n. Each of matrices into a N[X]-semialgebra, the construction works more of them consists of an identity matrix adjacent to a zero matrix, generally and this generality might prove useful in future research see Figure 1. Note that the selectors do not contain data that is projects that use different provenance semirings, as in [6]. subject to provenance tracking. We used x; y; u; v as subscripts in Let (K; + ; · ; 0 ; 1 ) be a commutative semiring. In previous the notation to make it easier to connect the formulas to the figure. K K K K work [1] it was shown, using a tensor product-like construction, In this form we can annotate B, etc. with provenance polynomials. how to embed aggregation domains, (specifically, commutative Indeed, let M be the many-sorted 1 ring of matrices of real num- monoids) into “most economical” K-semimodules that allow us bers. In Section 3 we show how to embed M into a semialgebra to track the provenance captured by the elements of K. Here we of matrices annotated with polynomials, namely the tensor prod- consider matrices, a more complicated aggregation domain, since uct N[X] ⊗ M. Indeed, B corresponds to 1 ⊗ B 2 N[X] ⊗ M it has two operations (+ and ·) so we aim for a richer algebraic and annotation with xu 2 N[X] corresponds to multiplication with structure, that of (many-sorted) semialgebra. We shall embed M scalars: xu ∗ (1 ⊗ B) = xu ⊗ B. In turn, the selectors, since they into a structure that, like M, forms a ring (K⊗M; +; ·; 0; I) (for do not contain data that needs to be tracked, remain with the neutral simplicity we use for the operations the same notation we used for annotation, e.g., 1 ⊗ Sx. For simplicity we denote 1 ⊗ M with M M). In addition, K⊗M forms a K-semialgebra, i.e., it also has for any matrix M. Hence ∗, a binary operation K × (K⊗M) ! K⊗M that satisfies the following algebraic identities. A = Sx (xu ∗ B) Tu + Sx (xv ∗ C) Tv + Sy (yu ∗ D) Tu + Sy (yv ∗ E) Tv DEFINITION 1. 8 k; k1; k2 2 K 8 A; A1;A2 2 K⊗M The decomposition into submatrices using selectors is nicely com- k ∗ (A1 + A2) = k ∗ A1 + k ∗ A2 (1) patible with transposition. Indeed we have: k ∗ 0 = 0 (2) T T T T T T T A = Tu (xu ∗ B ) Sx + Tu (yu ∗ D ) Sy (k1 +K k2) ∗ A = k1 ∗ A + k2 ∗ A (3) + TT (xv ∗ CT ) ST + TT (yv ∗ ET ) ST v x v y 0K ∗ A = 0 (4) T Now to multiply AA we can (as will be justified in Section 3) use (k1 ·K k2) ∗ A = k1 ∗ (k2 ∗ A) (5) any identity that is valid in matrix algebra (associativity, distribu- 1K ∗ A = A (6) tivity, etc.). We multiply two sums of four products each so that (k ∗ A )(k ∗ A ) = (k · k ) ∗ (A A ) (7) will give a sum of sixteen products. Let’s just look at two of them 1 1 2 2 1 K 2 1 2 (In particular, (K⊗M; +; 0; ∗) is a (many-sorted) K-semimodule.) S (xu ∗ B)T TT (yu ∗ DT )ST S (xv ∗ C)T TT (yv ∗ ET )ST x u u y x v v y In Section 2 we used the identity (xu ∗ B)(yu ∗ DT ) = xyu2 ∗ T T T We will show in Section 4 that Tu Tu and Tv Tv are actually BD . Indeed, this is an instance of semialgebra axiom (7) above. identity matrices. Moreover, it will follow from Section 3 that We proceed to sketch the construction of K⊗M. (xu∗B)(yu∗DT ) = xyu2∗BDT (xv∗C)(yv∗ET ) = xyv2∗CET Tensor product construction We start with K×M, sorted just M k⊗A hk; Ai With these observations it has become clear that like , denote its elements instead of and call them “simple tensors”. Next we consider the set Bag(K×M) of finite T 2 2 T 2 2 T T AA = Sx(x u ∗ BB + x v ∗ CC )Sx bags of such simple tensors, i.e., functions f : K×M ! N whose 2 T 2 T T support, i.e., supp(f) = fk⊗A 2 K×M j f(k⊗A) 6= 0g, is + Sx(xyu ∗ BD + xyv ∗ CE )S y finite ] ; 2 T 2 T T . Let be the usual bag union and + be the empty bag.