Lineage Tracing for General Data Warehouse Transformations

Lineage Tracing for General Data Warehouse Transformations∗ Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University fcyw, [email protected] Abstract. Data warehousing systems integrate information and managing such transformations as part of the extract- from operational data sources into a central repository to enable transform-load (ETL) process, e.g., [Inf, Mic, PPD, Sag]. analysis and mining of the integrated information. During the The transformations may vary from simple algebraic op- integration process, source data typically undergoes a series of erations or aggregations to complex procedural code. transformations, which may vary from simple algebraic opera- In this paper we consider the problem of lineage trac- tions or aggregations to complex “data cleansing” procedures. ing for data warehouses created by general transforma- In a warehousing environment, the data lineage problem is that tions. Since we no longer have the luxury of a fixed set of of tracing warehouse data items back to the original source items operators or the algebraic properties offered by relational from which they were derived. We formally define the lineage views, the problem is considerably more difficult and tracing problem in the presence of general data warehouse trans- open-ended than previous work on lineage tracing. Fur- formations, and we present algorithms for lineage tracing in this thermore, since transformation graphs in real ETL pro- environment. Our tracing procedures take advantage of known cesses can often be quite complex—containing as many structure or properties of transformations when present, but also as 60 or more transformations—the storage requirements work in the absence of such information. Our results can be used and runtime overhead associated with lineage tracing are as the basis for a lineage tracing tool in a general warehousing very important considerations. setting, and also can guide the design of data warehouses that We develop an approach to lineage tracing for general enable efficient lineage tracing. transformations that takes advantage of known structure or properties of transformations when present, yet pro- 1 Introduction vides tracing facilities in the absence of such information Data warehousing systems integrate information from op- as well. Our tracing algorithms apply to single transfor- erational data sources into a central repository to enable mations, to linear sequences of transformations, and to analysis and mining of the integrated information [CD97, arbitrary acyclic transformation graphs. We present opti- LW95]. Sometimes during data analysis it is useful to mizations that effectively reduce the storage and runtime look not only at the information in the warehouse, but overhead in the case of large transformation graphs. Our also to investigate how certain warehouse information was results can be used as the basis for an in-depth data ware- derived from the sources. Tracing warehouse data items house analysis and debugging tool, by which analysts can back to the source data items from which they were de- browse their warehouse data, then trace back to the source rived is termed the data lineage problem [CWW00]. En- data that produced warehouse data items of interest. Our abling lineage tracing in a data warehousing environment results also can guide the design of data warehouses that has several benefits and applications, including in-depth enable efficient lineage tracing. data analysis and data mining, authorization management, The main contributions of our work are: view update, efficient warehouse recovery, and others as • In Sections 2 and 3 we define data transformations outlined in, e.g., [BB99, CWW00, HQGW93, LBM98, formally and identify a set of relevant transforma- LGMW00, RS98, RS99, WS97]. tion properties. We define data lineage for general In previous work [CW00, CWW00], we studied the warehouse transformations exhibiting these proper- warehouse data lineage problem in depth, but we only ties, but we also cover “black box” transformations considered warehouse data defined as relational materi- with no known properties. The transformation prop- alized views over the sources, i.e., views specified using erties we consider can be specified easily by transfor- SQL or relational algebra. Related work has focused on mation authors, and they encompass a large majority even simpler relational views [Sto75] or on multidimen- of transformations used for real data warehouses. sional views [DB2, Pow]. In real production data warehouses, however, data imported from the sources is gen- • In Section 3 we develop lineage tracing algorithms erally “cleansed”, integrated, and summarized through a for single transformations. Our algorithms take ad- sequence or graph of transformations, and many com- vantage of transformation properties when they are mercial warehousing systems provide tools for creating present, and we also suggest how indexes can be used to further improve tracing performance. ∗ *This work was supported by the National Science Foundation un- • In Section 4 and the full version of this paper der grants IIS-9811947 and IIS-9817799. [CW01] we develop a general algorithm for lineage Permission to copy without fee all or part of this material is granted pro- tracing through a sequence or graph of transforma- vided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication tions. Our algorithm includes methods for combin- and its date appear, and notice is given that copying is by permission of ing transformations so that we can reduce overall the Very Large Data Base Endowment. To copy otherwise, or to repub- tracing cost, including the number of transforma- lish, requires a fee and/or special permission from the Endowment. tions we must trace through and the number of in- Proceedings of the 27th VLDB Conference, termediate results that must be stored or recomputed Roma, Italy, 2001 for the purpose of lineage tracing. • We have implemented a prototype lineage tracing prod-id prod-name category price valid system based on our algorithms, and in the full ver- 111 Apple IMAC computer 1200 10/1/1998– 222 Sony VAIO computer 3280 9/1/1998–11/30/1998 sion of this paper [CW01] we present a few initial 222 Sony VAIO computer 2250 12/1/1998–9/30/1999 performance results. 222 Sony VAIO computer 1950 10/1/1999– 333 Canon A5 electronics 400 4/2/1999– For examples in this paper we use the relational data 444 Sony VAIO computer 2750 12/1/1998– model, but our approach and results clearly apply to data objects in general. Figure 1: Source data set for Product This version of the paper is considerably reduced from the original full version [CW01]. Readers in- order-id cust-id date prod-list terested in our complete treatment of the topic are 0101 AAA 2/1/1999 333(10), 222(10) directed to read the full version instead of this one. 0102 BBB 2/8/1999 111(10) 0379 CCC 4/9/1999 222(5), 333(5) In particular, this version omits details of several al- 0524 DDD 6/9/1999 111(20), 333(20) gorithms, discussion of nondeterministic transforma- 0761 EEE 8/21/1999 111(10) tions, indexing techniques, lineage tracing for trans- 0952 CCC 11/8/1999 111(5) formations with multiple input and output sets, trac- 1028 DDD 11/24/1999 222(10) ing through arbitrary transformation graphs, perfor- 1250 BBB 12/15/1999 222(10), 333(10) mance experiments, avenues of future work, some ex- Figure 2: Source data set for Order amples, and proofs for all theorems. erating lineage tracing procedures automatically for vari- 1.1 Related Work ous classes of relational and multidimensional views, but none of these approaches can handle warehouse data cre- There has been a significant body of work on data ated through general transformations. In [FJS97], a sta- transformations in general, including aspects such as tistical approach is used for reconstructing base (lineage) transforming data formats, models, and schemas, e.g., + + + data from summary data in the presence of certain con- [ACM 99, BDH 95, CR99, HMN 99, LSS96, RH00, straints. However, the approach provides only estimated Shu87, Squ95]. Often the focus is on data integration or lineage information and does not ensure accuracy. Fi- warehousing, but none of these papers considers lineage nally, [LGMW00] considers an ETL setting like ours, and tracing through transformations, or even addresses the re- defines the concept of a contributor in order to enable lated problem of transformation inverses. efficient resumption of interrupted warehouse loads. Al- Most previous work on data lineage focuses on coarse- though similar in overall spirit, the definition of a contrib- grained (or schema-level) lineage tracing, and uses an- utor is different from our definition of data lineage, and notations to provide lineage information such as which does not capture all aspects of lineage we consider in this transformations were involved in producing a given ware- paper. In addition, we consider a more general class of house data item [BB99, LBM98], or which source at- transformations than those considered in [LGMW00]. tributes derive certain warehouse attributes [HQGW93, RS98]. By contrast, we consider fine-grained (or instance-level) lineage tracing: we retrieve the actual 1.2 Running Example set of source data items that derived a given warehouse We present a small running example, designed to il- data item. As will be seen, in some cases we can lustrate problems and techniques throughout the paper. use coarse-grained lineage information (schema map- Consider a data warehouse with retail store data derived pings) in support of our fine-grained lineage tracing tech- from two source tables: niques. In [Cui01], we extend the work in this paper with an annotation-based technique for instance-level lineage Product(prod-id; prod-name; category; price; valid) tracing, similar in spirit to the schema-level annotation Order(order-id; cust-id; date; prod-list) techniques in [BB99].

Lineage Tracing for General Data Warehouse Transformations

Data Warehouse Fundamentals for Storage Professionals – What You Need to Know EMC Proven Professional Knowledge Sharing 2011

Informal Data Transformation Considered Harmful

Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation Jie Song George Alter H

GAVS' Blockchain-Based

What Is a Data Warehouse?

POLITECNICO DI TORINO Repository ISTITUZIONALE

Data Warehousing on AWS

Master Data Management Whitepaper.Indd

Data Governance with Oracle

Data Management Capability

Data Warehouse Applications in Modern Day Business

Lecture Notes on Data Mining& Data Warehousing