<<

Lineage Tracing for General Warehouse Transformations∗ Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University {cyw, widom}@db.stanford.edu

Abstract. Data warehousing systems integrate information and managing such transformations as part of the extract- from operational data sources into a central repository to enable transform-load (ETL) process, e.g., [Inf, Mic, PPD, Sag]. analysis and mining of the integrated information. During the The transformations may vary from simple algebraic op- integration process, source data typically undergoes a series of erations or aggregations to complex procedural code. transformations, which may vary from simple algebraic opera- In this paper we consider the problem of lineage trac- tions or aggregations to complex “” procedures. ing for data warehouses created by general transforma- In a warehousing environment, the problem is that tions. Since we no longer have the luxury of a fixed set of of tracing warehouse data items back to the original source items operators or the algebraic properties offered by relational from which they were derived. We formally define the lineage views, the problem is considerably more difficult and tracing problem in the presence of general trans- open-ended than previous work on lineage tracing. Fur- formations, and we present algorithms for lineage tracing in this thermore, since transformation graphs in real ETL pro- environment. Our tracing procedures take advantage of known cesses can often be quite complex—containing as many structure or properties of transformations when present, but also as 60 or more transformations—the storage requirements work in the absence of such information. Our results can be used and runtime overhead associated with lineage tracing are as the basis for a lineage tracing tool in a general warehousing very important considerations. setting, and also can guide the design of data warehouses that We develop an approach to lineage tracing for general enable efficient lineage tracing. transformations that takes advantage of known structure or properties of transformations when present, yet pro- 1 Introduction vides tracing facilities in the absence of such information Data warehousing systems integrate information from op- as well. Our tracing algorithms apply to single transfor- erational data sources into a central repository to enable mations, to linear sequences of transformations, and to analysis and mining of the integrated information [CD97, arbitrary acyclic transformation graphs. We present opti- LW95]. Sometimes during it is useful to mizations that effectively reduce the storage and runtime look not only at the information in the warehouse, but overhead in the case of large transformation graphs. Our also to investigate how certain warehouse information was results can be used as the basis for an in-depth data ware- derived from the sources. Tracing warehouse data items house analysis and debugging tool, by which analysts can back to the source data items from which they were de- browse their warehouse data, then trace back to the source rived is termed the data lineage problem [CWW00]. En- data that produced warehouse data items of interest. Our abling lineage tracing in a data warehousing environment results also can guide the design of data warehouses that has several benefits and applications, including in-depth enable efficient lineage tracing. data analysis and , authorization management, The main contributions of our work are: update, efficient warehouse recovery, and others as • In Sections 2 and 3 we define data transformations outlined in, e.g., [BB99, CWW00, HQGW93, LBM98, formally and identify a set of relevant transforma- LGMW00, RS98, RS99, WS97]. tion properties. We define data lineage for general In previous work [CW00, CWW00], we studied the warehouse transformations exhibiting these proper- warehouse data lineage problem in depth, but we only ties, but we also cover “black box” transformations considered warehouse data defined as relational materi- with no known properties. The transformation prop- alized views over the sources, i.e., views specified using erties we consider can be specified easily by transfor- SQL or relational algebra. Related work has focused on mation authors, and they encompass a large majority even simpler relational views [Sto75] or on multidimen- of transformations used for real data warehouses. sional views [DB2, Pow]. In real production data ware- houses, however, data imported from the sources is gen- • In Section 3 we develop lineage tracing algorithms erally “cleansed”, integrated, and summarized through a for single transformations. Our algorithms take ad- sequence or graph of transformations, and many com- vantage of transformation properties when they are mercial warehousing systems provide tools for creating present, and we also suggest how indexes can be used to further improve tracing performance. ∗ *This work was supported by the National Science Foundation un- • In Section 4 and the full version of this paper der grants IIS-9811947 and IIS-9817799. [CW01] we develop a general algorithm for lineage Permission to copy without fee all or part of this material is granted pro- tracing through a sequence or graph of transforma- vided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication tions. Our algorithm includes methods for combin- and its date appear, and notice is given that copying is by permission of ing transformations so that we can reduce overall the Very Large Data Base Endowment. To copy otherwise, or to repub- tracing cost, including the number of transforma- lish, requires a fee and/or special permission from the Endowment. tions we must trace through and the number of in- Proceedings of the 27th VLDB Conference, termediate results that must be stored or recomputed Roma, Italy, 2001 for the purpose of lineage tracing. • We have implemented a prototype lineage tracing prod-id prod-name category price valid system based on our algorithms, and in the full ver- 111 Apple IMAC computer 1200 10/1/1998– 222 Sony VAIO computer 3280 9/1/1998–11/30/1998 sion of this paper [CW01] we present a few initial 222 Sony VAIO computer 2250 12/1/1998–9/30/1999 performance results. 222 Sony VAIO computer 1950 10/1/1999– 333 Canon A5 electronics 400 4/2/1999– For examples in this paper we use the relational data 444 Sony VAIO computer 2750 12/1/1998– model, but our approach and results clearly apply to data objects in general. Figure 1: Source data set for Product This version of the paper is considerably reduced from the original full version [CW01]. Readers in- order-id cust-id date prod-list terested in our complete treatment of the topic are 0101 AAA 2/1/1999 333(10), 222(10) directed to read the full version instead of this one. 0102 BBB 2/8/1999 111(10) 0379 CCC 4/9/1999 222(5), 333(5) In particular, this version omits details of several al- 0524 DDD 6/9/1999 111(20), 333(20) gorithms, discussion of nondeterministic transforma- 0761 EEE 8/21/1999 111(10) tions, indexing techniques, lineage tracing for trans- 0952 CCC 11/8/1999 111(5) formations with multiple input and output sets, trac- 1028 DDD 11/24/1999 222(10) ing through arbitrary transformation graphs, perfor- 1250 BBB 12/15/1999 222(10), 333(10) mance experiments, avenues of future work, some ex- Figure 2: Source data set for Order amples, and proofs for all theorems. erating lineage tracing procedures automatically for vari- 1.1 Related Work ous classes of relational and multidimensional views, but none of these approaches can handle warehouse data cre- There has been a significant body of work on data ated through general transformations. In [FJS97], a sta- transformations in general, including aspects such as tistical approach is used for reconstructing base (lineage) transforming data formats, models, and schemas, e.g., + + + data from summary data in the presence of certain con- [ACM 99, BDH 95, CR99, HMN 99, LSS96, RH00, straints. However, the approach provides only estimated Shu87, Squ95]. Often the focus is on or lineage information and does not ensure accuracy. Fi- warehousing, but none of these papers considers lineage nally, [LGMW00] considers an ETL setting like ours, and tracing through transformations, or even addresses the re- defines the concept of a contributor in order to enable lated problem of transformation inverses. efficient resumption of interrupted warehouse loads. Al- Most previous work on data lineage focuses on coarse- though similar in overall spirit, the definition of a contrib- grained (or schema-level) lineage tracing, and uses an- utor is different from our definition of data lineage, and notations to provide lineage information such as which does not capture all aspects of lineage we consider in this transformations were involved in producing a given ware- paper. In addition, we consider a more general class of house data item [BB99, LBM98], or which source at- transformations than those considered in [LGMW00]. tributes derive certain warehouse attributes [HQGW93, RS98]. By contrast, we consider fine-grained (or instance-level) lineage tracing: we retrieve the actual 1.2 Running Example set of source data items that derived a given warehouse We present a small running example, designed to il- data item. As will be seen, in some cases we can lustrate problems and techniques throughout the paper. use coarse-grained lineage information (schema map- Consider a data warehouse with retail store data derived pings) in support of our fine-grained lineage tracing tech- from two source tables: niques. In [Cui01], we extend the work in this paper with an annotation-based technique for instance-level lineage Product(prod-id; prod-name; category; price; valid) tracing, similar in spirit to the schema-level annotation Order(order-id; cust-id; date; prod-list) techniques in [BB99]. It is worth noting that although an annotation-based approach can improve lineage tracing The Product is mostly self-explanatory. Attribute performance, it is likely to slow down warehouse loading valid specifies the time period during which a price is ef- and refresh, so an annotation-based approach may only be fective.1 The Order table also is mostly self-explanatory. desirable for lineage-intensive applications. Attribute prod-list specifies the list of ordered products In [WS97], a general framework is proposed for com- with product ID and (parenthesized) quantity for each. puting fine-grained data lineage in a transformational set- Sample contents of small source tables are shown in Fig- ting. The paper defines and traces data lineage for each ures 1 and 2. transformation based on a weak inverse, which must be Suppose an analyst wants to build a warehouse table specified by the transformation definer. Lineage trac- listing computer products that had a significant sales jump ing through a transformation graph proceeds by tracing in the last quarter: the last quarter sales were more than through each path one transformation at a time. In our twice the average sales for the preceding three quarters. approach, the definition and tracing of data lineage is A table SalesJump is defined in the data warehouse for based on general transformation properties, and we spec- this purpose. Figure 4 shows how the contents of table ify an algorithm for combining transformations in a se- quence or graph for improved tracing performance. In 1We assume that valid is a simple string, which unfortunately is a [CWW00, DB2, Sto75], algorithms are provided for gen- typical ad-hoc treatment of time. Order order-id cust-id date prod-list Product 0101 AAA 2/1/1999 333(10), 222(10) prod-id prod-name category price valid 0379 CCC 4/9/1999 222(5), 333(5) 222 Sony VAIO computer 2250 12/1/1998–9/30/1999 1028 DDD 11/24/1999 222(10) 222 Sony VAIO computer 1980 10/1/1999– 1250 BBB 12/15/1999 222(10), 333(10)

Figure 3: Lineage of hSony VAIO; 11250; 39600i

SalesJump can be specified using a transformation graph Order 1

G with inputs Order and Product. G is a directed acyclic 3 4 5 6 7 SalesJump graph composed of the following seven transformations: Product 2 • T1 splits each input order according to its prod- uct list into multiple orders, each with a single or- dered product and quantity. The output has schema Figure 4: Transformations to derive SalesJump horder-id; cust-id; date; prod-id; quantityi. 2.1 Transformations • T filters out products not in the computer category. 2 Let a data set be any set of data items—tuples, values, • T3 effectively performs a relational join on the complex objects—with no duplicates in the set. (The ef- outputs from T1 and T2, with T1:prod-id = fect duplicates have on lineage tracing has been addressed T2:prod-id and T1:date occurring in the pe- in some detail in [CWW00].) A transformation T is any riod of T2:valid. T3 also drops attributes cust-id procedure that takes data sets as input and produces data and category, so the output has schema horder-id, sets as output. Here we will consider only transforma- date; prod-id; quantity; prod-name; price; validi. tions that take a single data set as input and produce a • T computes the quarterly sales for each product. It single output set. We extend our results to transforma- 4 tions with multiple input sets and output sets in [CW01]. groups the output from T3 by prod-name, computes the total sales for each product for the four previ- For any input data set I, we say that the application of T ous quarters, and pivots the results to output a ta- to I resulting in an output set O, denoted T (I) = O, is an ble with schema hprod-name; q1; q2; q3; q4i, where instance of T . q1–q4 are the quarterly sales. Given transformations T1 and T2, their composition T = T1 ◦ T2 is the transformation that first applies T1 to I 0 0 • T5 computes from the output of T4 the average sales to obtain I , then applies T2 to I to obtain O. T1 and T2 of each product in the first three quarters. The output are called T ’s component transformations. The composi- schema is hprod-name; q1; q2; q3; avg3; q4i, where tion operation is associative: (T1 ◦T2)◦T3 = T1 ◦(T2 ◦T3). avg3 is the average sales (q1 + q2 + q3)=3. Thus, given transformations T1; T2; : : : ; Tn, we represent the composition ((T1 ◦ T2) ◦ : : : ) ◦ Tn as a transformation • T6 selects those products whose last quarter’s sales were greater than twice the average of the preceding sequence T1 ◦· · ·◦Tn. A transformation that is not defined three quarters. as a composition of other transformations is atomic. For now we will assume that all of our transforma- • T7 performs a final projection to output SalesJump tions are stable and deterministic. A transformation T with schema hprod-name; avg3; q4i. is stable if it never produces spurious output items, i.e., T ( ) = . A transformation is deterministic if it always Note that some of these transformations (T2, T5, T6, and ∅ ∅ T ) could be expressed as standard relational operations, produces the same output set given the same input set. All 7 of the example transformations we have seen are stable while others (T1, T3, and T4) could not. As a simple lineage example, for the data in Figures 1 and deterministic. An example of an unstable transforma- and 2 the warehouse table SalesJump contains tuple tion is one that appends a fixed data item or set of items to t = hSony VAIO; 11250; 39600i, indicating that the sales every output set, regardless of the input. An example of of VAIO computers jumped from an average of 11250 in a nondeterminstic transformation is one that transforms a the first three quarters to 39600 in the last quarter. An random sample of the input set. In practice we usually re- analyst may want to see the relevant detailed information quire transformations to be stable but often do not require by tracing the lineage of tuple t, that is, by inspecting the them to be deterministic. See [CW01] for a discussion of original input data items that produced t. Using the tech- when the deterministic assumption can be dropped. niques to be developed in this paper, from the source data in Figures 1 and 2 the analyst will be presented with the 2.2 Data Lineage lineage result in Figure 3. In the general case a transformation may inspect the en- tire input data set to produce each item in the output data 2 Transformations and Data Lineage set, but in most cases there is a much more fine-grained relationship between the input and output data items: a In this section, we formalize general data transformations data item o in the output set may have been derived from and data lineage, then we briefly motivate why transfor- a small subset of the input data items (maybe only one), mation properties can help us with lineage tracing. as opposed to the entire input data set. Given a transfor- I (I ) = O I O I O I O X Y X Y 1 1 1 1 1 2 a −1 1 2 2 2 a 2 3 3 2 a 2 2 4 3 3 b 0 b 0 4 4 3 5 5 4 6 6 3 5 5 Figure 5: A transformation instance mation instance T (I) = O and an output item o ∈ O, (a) dispatcher (b) aggregator (c) black−box we call the actual set I ⊆ I of input data items that con- Figure 6: Transformation classes tributed to o’s derivation the lineage of o, and we denote it as I = T (o; I). The lineage of a set of output data items (Section 3.1). Second, we may have one or more items O ⊆ O is the union of the lineage of each item in schema mappings for a transformation, specifying how the set: T (O; I) = S T (o; I). A detailed definition certain output attributes relate to input attributes (Sec- o2O∗ tion 3.2). Third, a transformation may be accompanied of data lineage for different types of transformations will by a tracing procedure or inverse transformation, which be given in Section 3. is the best case for lineage tracing (Section 3.3). When Knowing something about the workings of a transfor- a transformation exhibits many properties, we determine mation is important for tracing data lineage—if we know the best one for lineage tracing based on a property hier- nothing, any input data item may have participated in the archy (Section 3.4). derivation of an output item. Let us consider an exam- ple. Given a transformation T and its instance T (I) = O 3.1 Transformation Classes in Figure 5, the lineage of the output item ha; 2i depends on T ’s definition, as we will illustrate. Suppose T is a In this section, we define three transformation classes: transformation that filters out input items with a negative dispatchers, aggregators, and black-boxes. For each class, we give a formal definition of data lineage and spec- Y value (i.e., T = Y 0 in relational algebra). Then the lineage of output item o = ha; 2i should include only ify a lineage tracing procedure. We also consider sev- input item ha; 2i. Now, suppose instead that T groups eral subclasses for which we specify more efficient tracing the input data items based on their X values and com- procedures. Our informal studies have shown that about putes the sum of their Y values multiplied by 2 (i.e., 95% of the transformations used in real data warehouses are dispatchers, aggregators, or their compositions (cov- T = X;2sum(Y ) as Y in relational algebra, where performs grouping and aggregation). Then the lineage of ered in Section 4), and a large majority fall into the more output item o = ha; 2i should include input items ha; −1i efficient subclasses. and ha; 2i, because o is computed from both of them. We will refer back to these two transformations later (along 3.1.1 Dispatchers with our earlier examples from Section 1.2), so let us call A transformation T is a dispatcher if each input data the first one T8 and the second one T9. item produces zero or more output data items indepen- Given a transformation specified as a standard rela- dently: ∀I, T (I) = S T ({i}). Figure 6(a) illustrates a tional operator or view, we can define and retrieve the i2I exact data lineage for any output data item using the tech- dispatcher, in which input item 1 produces output items niques introduced in [CWW00]. On the other hand, if 1–4, input item 3 produces output items 3–6, and in- we know nothing at all about a transformation, then the put item 2 produces no output items. The lineage of an lineage of an output item must be defined as the entire in- output item o according to a dispatcher T is defined as put set. In reality transformations often lie between these T (o; I) = {i ∈ I | o ∈ T ({i})}. two extremes—they are not standard relational operators, A simple procedure TraceDS(T ; O; I) can be used but they have some known structure or properties that can to trace the lineage of a set of output items O ⊆ O ac- help us identify and trace data lineage. cording to a dispatcher T . The procedure applies T to The transformation properties we will consider often the input data items one at a time and returns those items can be specified easily by the transformation author, or that produce one or more items in O.2 Note that all of they can be inferred from the transformation definition our tracing procedures are specified to take a set of output (as relational operators, for example), or possibly even items as a parameter instead of a single output item, for “learned” from the transformation’s behavior. In this pa- generality and also for tracing lineage through transfor- per, we do not focus on how properties are specified or mation sequences and graphs. discovered, but rather on how they are exploited for lin- Example 3.1 (Lineage Tracing for Dispatchers) eage tracing. Transformation T1 in Section 1.2 is a dispatcher, because each input order produces one or more output orders via 3 Tracing Using Transformation Properties T1. Given an output item o = h0101; AAA; 2=1=1999; 222; 10i based on the sample data of Figure 2, we We consider three overall kinds of properties and provide algorithms that trace data lineage using these properties. 2For now we are assuming that the input set is readily available. First, each transformation is in a certain transformation Cases where the input set is unavailable or unnecessary will be con- class based on how it maps input data items to output sidered later. ∗ can trace o’s lineage according to T1 using procedure procedure TraceCF(T ; O ; I) ∗ TraceDS(T1; {o}; Order) to obtain T1 (o; Order) = I ← ∅; {h0101; AAA; 2=1=1999; “333(10); 222(10)”i}. Trans- pnum ← 0; for each i ∈ I do formations T2, T5, T6, and T7 in Section 1.2 and T8 in Section 2.2 all are dispatchers, and we can similarly trace if pnum = 0 then I1 ← {i}; pnum ← 1; continue; for (k ← 1; k ≤ pnum; k + +) do data lineage for them. if |T (Ik ∪ {i})| = 1 then Ik ← Ik ∪ {i}; break; TraceDS requires a complete scan of the input data if k > pnum then pnum ← pnum + 1; Ipnum ← {i}; for k ← 1::pnum do set, and for each input item i it calls transformation T ∗ ∗ ∗ if T (Ik) ⊆ O then I ← I ∪ Ik; over {i} which can be very expensive if T has signifi- return I∗; cant overhead (e.g., startup time). In [CW01] we discuss how indexes can be used to improve the performance of TraceDS. Next we introduce a common subclass of dis- Figure 7: Tracing proc. for context-free aggregators patchers, filters, for which lineage tracing is trivial. common subclasses of aggregators, context-free aggrega- tors and key-preserving aggregators, which allow us to Filters. A dispatcher T is a filter if each input item pro- apply much more efficient tracing procedures. duces either itself or nothing: ∀i ∈ I, T ({i}) = {i} or T ({i}) = . Thus, the lineage of any output data item is ∅ Context-Free Aggregators. An aggregator is context- the same item in the input set: ∀o ∈ O, T (o) = {o}. The T free if any two input data items either always belong to tracing procedure for a filter T simply returns the traced the same input partition, or they always do not, regard- item set O as its own lineage. It does not need to call less of the other items in the input set. In other words, the transformation T or scan the input data set, which can be a significant advantage in many cases (see Section 4). a context-free aggregator determines the partition that an input item belongs to based on its own value, and not on Transformation T in Section 2.2 is a filter, and the lin- 8 the values of any other input items. All example aggre- eage of output item o = ha; 2i is the same item ha; 2i in gators we have seen are context-free. As an example of a the input set. Other examples of filters are T2 and T6 in Section 1.2. non-context-free aggregator, consider a transformation T that clusters input data points based on their x-y coordi- nates and outputs some aggregate value of items in each 3.1.2 Aggregators cluster. Suppose T specifies that any two points within A transformation T is an aggregator if T is complete distance d from each other must belong to the same clus- (defined momentarily), and for all I and T (I) = O = ter. T is an aggregator, but it is not context-free, since {o1; : : : ; on}, there exists a unique disjoint partition whether two items belong to the same cluster or not may I1; : : : ; In of I such that T (Ik) = {ok} for k = 1::n. depend on the existence of a third item near to both. I1; : : : ; In is called the input partition, and Ik is ok’s lin- We specify tracing procedure TraceCF(T ; O ; I) in eage according to T : T (ok; I) = Ik. A transformation Figure 7 for context-free aggregators. This procedure first T is complete if each input data item always contributes to scans the input data set to create the partitions (which we some output data item: ∀I 6= ∅, T (I) 6= ∅. Figure 6(b) could not do linearly if the aggregator were not context- illustrates an aggregator, where the lineage of output item free), then it checks each partition to find those that pro- 1 is input items {1; 2}, the lineage of output item 2 is {3}, duce items in O. TraceCF reduces the number of trans- and the lineage of output item 3 is {4; 5; 6}. formation calls to |I 2| + |I| in the worst case, which is a Transformation T9 in Section 2 is an aggregator. The significant improvement. input partition is I1 = {ha; −1i; ha; 2i}, I2 = {hb; 0i}, and the lineage of output item o = ha; 2i is I1. Among the Key-Preserving Aggregators. Suppose each input item transformations in Section 1.2, T4, T5, and T7 are aggre- and output item contains a unique key value in the re- gators. Note that transformations can be both aggregators lational sense, denoted i:key for item i. An aggregator and dispatchers (e.g., T5 and T7 in Section 1.2). We will T is key-preserving if given any input set I and its in- address how overlapping properties affect lineage tracing put partition I1; : : : ; In for output T (I) = {o1; : : : ; on}, in Section 3.4. all subsets of Ik produce a single output item with the 0 To trace the lineage of an output subset O same key value as ok, for k = 1::n. That is, ∀I ⊆ Ik: 0 0 0 according to an aggregator T , we use procedure T (I ) = {ok} and ok:key = ok:key. TraceAG(T ; O; I), which enumerates subsets of in- Theorem 3.2 Key-preserving aggregators are context- put I. It returns the unique subset I such that I pro- free. duces exactly O, i.e., T (I) = O, and the rest of the input set produces the rest of the output set, i.e., All example aggregators we have seen are key-preserving. T (I − I) = O − O. During the enumeration, we ex- As an example of a context-free but non-key-preserving amine the subsets in increasing size. If we find a subset I 0 aggregator, consider a relational groupby-aggregationthat such that T (I0) = O but T (I − I0) 6= O − O, we then does not retain the grouping attribute. need to examine only supersets of I 0, which can reduce To trace the lineage of O according to a the work significantly. key-preserving aggregator T , we use procedure TraceAG may call T as many as 2jIj times in the TraceKP(T ; O; I), which scans the input data worst case, which can be prohibitive. We introduce two set once and returns all input items that produce output items with the same key as items in O . TraceKP 2. for k = (n + 1)::m, T (Ik) = ∅. reduces the number of transformation calls to |I|, with Similarly, we say that T has a backward schema map- each call operating on a single input data item. We can T further improve performance of TraceKP using an ping A ← g(B) if we can partition any input set I into index, as discussed in [CW01]. I1; : : : ; Im based on equality of A values, and partition the output set O = T (I) into O1; : : : ; On based on equal- ity of values, such that and: 3.1.3 Black-box Transformations g(B) m ≥ n 1. for k = 1::n, T (Ik) = Ok and Ik = {i ∈ I |i:A = An atomic transformation is called a black-box transfor- g(o:B) for any o ∈ O }. mation if it is neither a dispatcher nor an aggregator, and it k does not have a provided lineage tracing procedure (Sec- 2. for k = (n + 1)::m, T (Ik) = ∅. tion 3.3). In general, any subset of the input items may When f (or g) is the identity function, we simply write have been used to produce a given output item through a T T T T A → B (or A ← B). If A → B and A ← B we write black-box transformation, as illustrated in Figure 6(c), so T all we can say is that the entire input data set is the lin- A ↔ B. eage of each output item: ∀o ∈ O, T (o; I) = I. Thus, Although Definition 3.3 may seem cumbersome, it for- the tracing procedure for a black-box transformation sim- mally and accurately captures the intuitive notion of ply returns the entire input I. schema mappings (certain input attributes producing cer- As an example of a true black-box, consider a trans- tain output attributes) that transformations often exhibit. formation T that sorts the input data items and attaches a serial number to each output item according to its Example 3.4 Schema information for transformation T5 sorted position. For instance, given input data set I = in Section 1.2 can be specified as: {hf; 10i; hb; 20i; hc; 5i} and sorting by the first attribute, Input schema and key: the output is 1 2 3 , and T (I) = {h ; b; 20i; h ; c; 5i; h ; f; 10i} A = hprod-name; q1; q2; q3; q4i, A = hprod-namei the lineage of each output data item is the entire input set key I. Note that in this case each output item, in particular its Output schema and key: B serial number, is indeed derived from all input data items. = hprod-name; q1::q4; avg3i. Bkey = hprod-namei Schema mappings: T5 3.2 Schema Mappings hprod-name; q1::q4i ↔ hprod-name; q1::q4i Schema information can be very useful in the ETL pro- T5 (a+b+c) f(hq1::q3i) → havg3i, where f(ha; b; ci) = 3 cess, and many data warehousing systems require trans- formation programmers to provide some schema informa- Theorem 3.5 Consider a transformation T that is a dis- tion. In this section, we discuss how we can use schema patcher or an aggregator, and consider any instance information to improve lineage tracing for dispatchers and T (I) = O. Given any output item o ∈ O, let I be o’s aggregators. Sometimes schema information also can im- lineage according to the lineage definition for T ’s trans- prove lineage tracing for a black-box transformation , formation class in Section 3.1. If T has a forward schema T T specifically when T can be combined with another non- mapping f(A) → B, then I ⊆ {i ∈ I | f(i:A) = o:B}. black-box transformation based on T ’s schema informa- If T has a backward schema mapping A ←T g(B), then tion (Section 4). A schema specification may include: I ⊆ {i ∈ I | i:A = g(o:B)}. A A input schema = hA1::Api and input key Akey ⊆ Based on Theorem 3.5, when tracing lineage for a dis- B B output schema = hB1::Bqi and output key Bkey ⊆ patcher or aggregator, we can narrow down the lineage The specification also may include schema mappings, de- of any output data item to a (possibly very small) sub- fined as follows. set of the input data set based on a schema mapping. We can then retrieve the exact lineage within that subset Definition 3.3 (Schema Mappings) Consider a transfor- using the algorithms in Section 3.1. For example, con- mation T with input schema A and output schema B. A B sider an aggregator T with a backward schema mapping Let A ⊆ and B ⊆ be lists of input and output at- T tributes. Let i:A denote the A attribute values of i, and A ← g(B). When tracing the lineage of an output item similarly for . Let and be functions from tuples of o ∈ O according to T , we can first find the input subset o:B f g 0 attribute values to tuples of attribute values. We say that I = {i ∈ I | i:A = g(o:B)}, then enumerate subsets of 0 0 0 T I using TraceAG(T ; o; I ) to find o’s lineage I ⊆ I . T has a forward schema mapping f(A) → B if we can If we have multiple schema mappings for T , we can use partition any input set I into I1; : : : ; Im based on equality 3 their intersection for improved tracing efficiency. of f(A) values, and partition the output set O = T (I) Although the narrowing technique of the previous into O1; : : : ; On based on equality of B values, such that paragraph is effective, when schema mappings satisfy cer- m ≥ n and: tain additional conditions, we obtain transformation prop- 1. for k = 1::n, T (Ik) = Ok and Ik = {i ∈ I | f(i:A) = erties that permit very efficient tracing procedures. o:B for any o ∈ O }. k Definition 3.6 (Schema Mapping Properties) Consider 3 A That is, two input items i1 ∈ I and i2 ∈ I are in the same partition a transformation T with input schema , input key Akey, Ik iff f(i1:A) = f(i2:A). output schema B, and output key Bkey. 1. T is a forward key-map (fkmap) if it is complete Property Tracing # Transformation # Input Procedure Calls Accesses (∀I 6= ∅, T (I) 6= ∅) and it has a forward schema T dispatcher TraceDS |I| |I| mapping to the output key: f(A) → Bkey. filter return o 0 0 2. T is a backward key-map (bkmap) if it has aggregator TraceAG O(2|I|) O(2|I|) a backward schema mapping to the input key: context-free aggr. TraceCF O(|I|2) O(|I|2) T key-preserving aggr. TraceKP |I| |I| Akey ← g(B). black-box return I 0 0 3. is a backward total-map (btmap) if it has a T forward key-map TraceFM 0 |I| backward schema mapping to all input attributes: backward key-map TraceBM 0 |I|

T A ← g(B). backward total-map TraceTM 0 0 tracing-proc. w/ input TP ? ? Suppose that schema information and mappings are given tracing-proc. w/o input TP ? 0 for all transformations in Section 1.2. Then all of the transformations except T are backward key-maps; T , 4 2 Figure 8: Summary of transformation properties T5, and T6 are backward total-maps; T4, T5, and T7 are forward key-maps. input data set, i.e., TP(O; I) returns O’s lineage accord- Theorem 3.7 (1) All filters are backward total-maps. (2) ing to T , or the tracing procedure may not require access All backward total-maps are backward key-maps. (3) All to the input, i.e., TP(O) returns O’s lineage. A related backward key-maps are dispatchers. (4) All forward key- but not identical situation is when we are provided with maps are key-preserving aggregators. the inverse for a transformation T . Sometimes, but not always, T ’s inverse can be used as T ’s tracing procedure. Theorem 3.8 Consider a transformation instance T (I) = O. Given an output item o ∈ O, let I be o’s Definition 3.9 (Inverse Transformation) A transforma- lineage based on T ’s transformation class as defined in tion T is invertible if there exists a transformation T 1 Section 3.1. such that ∀I, T 1(T (I)) = I, and ∀O, T (T 1(O)) = O. T 1 is called T ’s inverse. 1. If T is a forward key-map with schema mapping f(A) →T B , then I = {i ∈ I|f(i:A) = o:B }. Theorem 3.10 If a transformation T is an aggregator key key with inverse T 1, then for all instances T (I) = O and 2. If T is a backward key-map with schema mapping all o ∈ O, o’s lineage according to T is T 1({o}). A ←T g(B), then I = {i ∈ I|i:A = g(o:B)}. key key According to Theorem 3.10, we can use a transforma- 3. If T is a backward total-map with schema mapping tion’s inverse for lineage tracing if the invertible transfor- A ←T g(B), then I = {g(o:B)}. mation is an aggregator. However, if the invertible trans- formation is a dispatcher or black-box, we cannot always According to Theorem 3.8, we can use simple trac- use its inverse for lineage tracing, as illustrated by an ex- ing procedures for transformations with the schema map- ample in [CW01]. ping properties specified in Definition 3.6. Procedure Although we can guarantee very little about the accu- TraceFM(T ; O; I) performs lineage tracing for a for- racy or efficiency of provided tracing procedures or trans- ward key-map T , which by Theorem 3.7 also could be formation inverses in the general case, it is our experience traced using procedure TraceKP from Section 3.1.2. that, when provided, they are usually the most effective Both algorithms scan each input item once, however way to perform lineage tracing. We will make this as- TraceKP applies transformation T to each item, while sumption in the remainder of the paper. TraceFM applies function f to some attributes of each item. Certainly f is very unlikely to be more expen- 3.4 Property Summary and Hierarchy sive than T , since T effectively computes f and may do other work as well; f may in fact be quite a bit Figure 8 summarizes the transformation properties cov- cheaper. TraceBM(T ; O; I) uses a similar approach ered in the previous three sections. The table specifies for a backward key-map, and is usually more efficient which tracing procedure is applicable for each property, than TraceDS(T ; O; I) from Section 3.1.1 for the along with the number of transformation calls and num- same reasons. TraceTM(T ; O) performs lineage trac- ber of input data item accesses for each procedure. We ing for a backward total-map, which is very efficient since omit transformation inverses from the table, since when it does not need to scan the input data set and makes no applicable they are equivalent to a provided tracing pro- transformation calls. In [CW01], we discuss how indexes cedure not requiring input. can be used to further speed up procedures TraceFM and As discussed earlier, a transformation may satisfy TraceBM. more than one property. Some properties are better than others: tracing procedures may be more efficient, they 3.3 Provided Tracing Procedure or Inverse may return a more accurate lineage result, or they may not require access to input data. Figure 9 specifies a hierarchy If we are very lucky, a lineage tracing procedure may be for determining which property is best to use for a specific provided along with the specification of a transformation transformation. In the hierarchy, a solid arrow from prop- T . The tracing procedure TP may require access to the erty p1 to p2 means that p2 is more restrictive than p1, all I I2 O 11 1 2 1 1 black−box 2 1 10 3 aggregator 2 2 4 9 3 5 3 context−free aggr. 8 6 7 dispatcher key−preserving aggr. Figure 11: T1 ◦ T2 6 forward key−map 5 4 Lineage Tracing through a Sequence backward key−map 4 tracing proc. w/ input Having specified how to trace lineage for a single trans- 3 backward total−map 2 formation with one input set and one output set, we now tracing proc. w/o input 1 consider lineage tracing for sequences of such transforma- filter tions. Multiple input and output sets and arbitrary acyclic Figure 9: Transformation property hierarchy transformation graphs are covered in [CW01].

4.1 Data Lineage for a Transformation Sequence Best Property Additional Properties T1 backward key-map Consider a simple sequence of two transformations, such T2 filter as T ◦ T in Figure 11 composed from Figures 6(a) see [CW01] 1 2 T3 and 6(b). For an input data set , let and T4 forward key-map I I2 = T1(I) T5 backward total-map forward key-map O = T2(I2). Given an output data item o ∈ O, if I2 ⊆ I2 T6 filter is the lineage of o according to T2, and I ⊆ I is the T7 backward key-map forward key-map lineage of I2 according to T1, then I is the lineage of o T8 filter according to T1 ◦ T2. For example, in Figure 11 if o ∈ O T9 forward key-map is item 3, then I2 is items {4; 5; 6} in I2, and I is items Figure 10: Properties of T –T {1; 3} in I. This lineage definition generalizes to arbitrar- 1 9 ily long transformation sequences using the associativity of composition. i.e., all transformations that satisfy property p2 also sat- Given a transformation sequence T1 ◦ · · · ◦ Tn, where isfy property p1. Further, according to Sections 3.1–3.3, each Ik is the intermediate result output from Tk1 and in- whenever p2 is more restrictive than p1, the tracing proce- put to Tk, a correct but brute-force approach is to store all dure for p2 is no less efficient (and usually more efficient) intermediate results I2; : : : ; In (in addition to initial input by any : number of transformation calls, number I) at loading time, then trace lineage backward through of input accesses, and whether the input data is required at one transformation at a time. This approach is ineffi- all. (Black-box transformations, which are the only type cient both due to the large number of tracing procedure with less accurate lineage results, are placed in a separate calls when iterating through all transformations in the se- branch of the hierarchy.) A dashed arrow from property quence, and due to the high storage cost for all intermedi- p1 to p2 in the hierarchy means that even though p2 is not ate results. The longer the sequence, the less efficient the strictly more restrictive than p1, p2 does yield a tracing overall tracing process, and for realistic transformation se- procedure that again is no less efficient (and usually more quences (in practice sometimes as many as 60 transfor- efficient) by any measure.4 mations) the cost can be prohibitive. Furthermore, if any Let us make the reasonable assumption that a provided transformation T in the sequence is a black-box, we will tracing procedure requiring input is more efficient than end up tracing the lineage of the entire input to T regard- TraceBM, and that a tracing procedure not requiring in- less of what transformations follow T in the sequence. put is more efficient than TraceTM. Then we can derive Fortunately, it is often possible to relieve these problems a total order of the properties as shown by the numbers by combining adjacent transformations in a sequence for in Figure 9: the lower the number, the better the prop- the purpose of lineage tracing. Also since we do not al- erty is for lineage tracing. Given a set of properties for ways need input sets for lineage tracing as discussed in a transformation T , we always use the best one, i.e., the Section 3, some intermediate results can be discarded. one with the lowest number, to trace data lineage for T . We will use the following overall strategy. Figure 10 lists the best property for example transforma- • When a transformation sequence S = T1 ◦ · · ·◦ Tn is tions T1–T7 from Section 1.2 and T8–T9 from Section 2.2, defined, we first normalize the sequence, to be spec- along with other properties satisfied by these transforma- ified in Section 4.2, by combining transformations tions. Note that we list only the most restrictive property in S when it is beneficial to do so. We then deter- on each branch of the hierarchy. mine which intermediate results need to be saved for lineage tracing, based on the best properties for the remaining transformations. 4In some cases the tracing efficiency difference represented by a solid or dashed arrow is significant, while in other cases it is less so. This issue • When data is loaded through the transformation se- is discussed further in Section 4. quence, the necessary intermediate results are saved. • We can then trace the lineage of any output data item the tracing procedures are fairly similar, while they dif- o in the warehouse through the normalized transfor- fer significantly across groups. The group of a transfor- mation sequence using the iterative tracing proce- mation T , denoted group(T ), is the group that T ’s best dure described at the beginning of this section. property belongs to. The lower the group number, the better T is for lineage tracing, and we consider it bene- ficial to combine two transformations T1 and T2 only if 4.2 Transformation Sequence Normalization 5 group(T1 ◦ T2) ≤ max(group(T1); group(T2)). As discussed in Section 4.1, we want to combine trans- Based on the above approach, procedure BestCombo formations in a sequence for the purpose of lineage trac- in Figure 12, called by Normalize, finds the best pair of ing when it is beneficial to do so. Specifically, we can adjacent transformations to combine in sequence S, and combine transformations Tk1 and Tk by replacing the returns its index. The procedure returns 0 if no combi- two transformations with the single transformation T 0 = nation is beneficial. We consider a beneficial combina- Tk1 ◦ Tk, eliminating the intermediate result Ik and trac- tion to be the best if the combination leaves the fewest ing through the combined transformation in one step. “bad” transformations in the sequence, compared with To decide whether combining a pair of transforma- other candidates. Formally, we associate with S a vec- tions is beneficial, and to use combined transformations tor N[1::5], where N[j] is the number of transformations for lineage tracing, as a first step we need to determine in S that belong to group j. (So P N[j] equals the the properties of a combined transformation based on the j=1::5 properties of its component transformations. Function length of S.) Given two sequences S1 and S2 with vec- tors N1 and N2 respectively, let k be the highest index Combine(T1; T2), specified in [CW01], returns com- in which N1[k] differs from N2[k]. We say N1 < N2 if bined transformation T = T1 ◦ T2 and sets T ’s properties N1[k] < N2[k], which implies that S1 has fewer “bad” based on those of T1 and T2. Theoretically we can combine any adjacent transfor- transformations than S2. Then we say that the best com- mations in a sequence, in fact we can collapse the en- bination is the one that leads to the lowest vector N for tire sequence into one large transformation, but com- the resulting sequence. bined transformations may have less desirable properties After we finish combining transformations as de- than their component transformations, leading to less ef- scribed above, suppose the sequence still contains one or ficient or less accurate lineage tracing. Thus, we want more black-box transformations. During lineage tracing, to combine transformations only if it is beneficial to do we will end up tracing the lineage of the entire input to so. Given a transformation sequence, determining the the earliest (left-most) black-box T in S, regardless of best way to combine transformations in the sequence is what transformations follow T . Therefore, as a final step a difficult combinatorial problem—solving it accurately, we combine T with all transformations that follow T to or even just determining accurately when it is beneficial eliminate unnecessary tracing and storage costs. 2 to combine two transformations, would require a detailed Our Normalize procedure has complexity O(n ) for cost model that takes into account transformation prop- a transformation sequence of length n. Although we use a erties, the cost of applying a transformation, the cost of greedy algorithm and heuristics for estimating the benefit storing intermediate results, and an estimated workload of combining transformations, our approach is quite ef- (including, e.g., data size and tracing frequency). Devel- fective in improving tracing performance for sequences, oping such a cost model is beyond the scope of this paper. as shown in [CW01]. Instead, we suggest a greedy algorithm Normalize Example 4.1 Consider the sequence of transformations shown in Figure 12. The algorithm repeatedly finds ben- S = T4 ◦ T5 ◦ T6 ◦ T7 from Section 1.2. Figure 13 shows eficial combinations of transformation pairs in the se- the sequence and the best property of each transforma- quence, combines the “best” pair, and continues until no tion. The initial vector of S is N = [2; 2; 0; 0; 0]. Us- more beneficial combinations are found. In general, a ing our greedy normalization algorithm, we first consider combination should be considered beneficial only if it re- 0 combining T4 ◦T5 into T4 with best property fkmap, com- duces the overall tracing cost while improving or retain- 0 bining T5 ◦ T6 into T5 with best property btmap, or com- ing tracing accuracy. We determine whether it is benefi- 0 bining T6 ◦ T7 into T6 with best property bkmap. It turns cial to combine two transformations based solely on their out that all these combinations reduce S’s vector N to properties using the following two heuristics. First, we 0 [1; 2; 0; 0; 0]. So let us combine T4 ◦ T5 obtaining T4 , T6, do not combine transformations into black-boxes, unless and T . In the new sequence, combining T 0 ◦ T results in we are certain that the combination will not degrade the 7 4 6 a black-box, which is disallowed, while combining T6 ◦T7 accuracy of the lineage result, which can only be deter- results in a transformation T 0 with best property bkmap, Normalize 6 mined as a last step of the procedure. Sec- which reduces N to [0; 2; 0; 0; 0]. Therefore, we choose to ond, we do not combine transformations if their com- 0 0 combine T6 ◦ T7 obtaining T4 and T6 . Combining these position is significantly worse for lineage tracing, i.e., two transformations would result in a black-box, so we it has much higher tracing cost or leads to a less accu- stop at this point. The final normalized sequence is shown rate result. We divide the properties in Figure 9 into five groups: group 1 contains properties 1–3, group 2 con- 5 tains properties 4–8, group 3 contains property 9, group Note that the presence of non-key-map schema mappings (Section 3.2) or indexes [CW01] for a transformation T does not improve T ’s 4 contains property 10, and group 5 contains property tracing efficiency to the point of moving it to a different group. Thus, 11. Within each group, the efficiency and accuracy of we do not take these factors into account in our decision process. procedure Normalize(S = T1 ◦ · · · ◦ Tn) [CW01] Y. Cui and J. Widom. Lineage tracing for gen- while (k ← BestCombo(S)) =6 0 eral data warehouse transformations. Technical re- port, Stanford University Group, January 2001. replace Tk and Tk+1 with T ← Combine(Tk; Tk+1); if S contains black-box transformations then http://dbpubs.stanford.edu/pub/2001-5. j ← lowest index of a black-box in S; [CWW00] Y. Cui, J. Widom, and J.L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Transactions on replace Tj ; : : : ; Tn with T ← Tj ◦ · · · ◦ Tn; Database Systems, 25(2):179–227, June 2000.

procedure BestCombo(S = T1 ◦ · · · ◦ Tn) [DB2] IBM: DB2 OLAP Server. http://www.software.ibm.com/data/db2/. k ← 0; N[1::5] ← [0; 0; 0; 0; 0]; [FJS97] C. Faloutsos, H.V. Jagadish, and N.D. Sidiropoulos. Recov- for j = 1::n do g ← group(Tj); N[g] ← N[g] + 1; ering information from summary data. In Proc. of the Twenty-Third for j = 1::n − 1 do International Conference on Very Large Data Bases, pages 36–45, curN ← N; Athens, Greece, August 1997. T ← Combine(Tj ; Tj+1); g ← group(T ); [HMN+99] L.M. Haas, R.J. Miller, B. Niswonger, M.T. Roth, P.M. if g < 5 then // T is not a black-box Schwarz, and E.L. Wimmers. Transforming heterogeneous data with g1 ← group(Tj); g2 ← group(Tj+1); database middleware: Beyond integration. IEEE Data Engineering curN[g] ← curN[g] + 1; Bulletin, 22(1):31–36, March 1999. curN[g1] ← curN[g1] − 1; [HQGW93] N. I. Hachem, K. Qiu, M. Gennert, and M. Ward. Manag- curN[g2] ← curN[g2] − 1; ing derived data in the Gaea scientific DBMS. In Proc. of the Nin- if curN < N then k ← j; N ← curN; teenth International Conference on Very Large Data Bases, pages return k; 1–12, Dublin, Ireland, August 1993. [Inf] Informix Formation Tool. http://www.informix.com/informix/products/integration/ Figure 12: Normalizing a transformation sequence formation/formation.html. [LBM98] T. Lee, S. Bressan, and S. Madnick. Source attibution for fkmap btmap filter bkmap querying against semi-structured documents. In Proc. of the Work- 4 5 6 7 shop on Web Information and , pages 33–39, Washington, DC, November 1998. Figure 13: Before normalization [LGMW00] W.J. Labio, H. Garcia-Molina, and J.L. Weiner. Efficient fkmap bkmap resumption of interrupted warehouse loads. In Proc. of the ACM SIGMOD International Conference on Management of Data, pages ' o ' o 4 = 4 5 6 = 6 7 46–57, Dallas, Texas, May 2000. Figure 14: After normalization [LSS96] L. Lakshmanan, F. Sadri, and I.N. Subramanian. SchemaSQL – a language for interoperability in relational multi-database sys-

tems. In Proc. of the Twenty-Second International Conference on in Figure 14. Very Large Data Bases, pages 239–250, India, September 1996. [LW95] D. Lomet and J. Widom, editors. Special Issue on Materialized Views and Data Warehousing, IEEE Data Engineering Bulletin 18(2), 5 Conclusions and Reminder June 1995. We have developed a complete set of techniques for [Mic] Microsoft SQL Server, Data Transformation Services. data warehouse lineage tracing when the warehouse http://msdn.microsoft.com/library/psdk/sql/dts ovrw.htm. data is loaded through a graph of general trans- [Pow] Cognos: PowerPlay OLAP Data Analysis and Reporting Tool. formations. Some of our techniques are presented http://www.cognos.com/powerplay/. in this version of the paper, but interested readers [PPD] PPD Informatics: TableTrans Data Transformation Software. http://www.belmont.com/tt.html. are strongly encouraged to investigate the full ver- [RH00] V. Raman and J. Hellerstein. Potters Wheel: An interactive sion [CW01] for a complete treatment of the topic: framework for data cleaning. Technical report, U.C. Berkeley, 2000. http://dbpubs.stanford.edu/pub/2001-5. http://control.cs.berkeley.edu/abc. [RS98] A. Rosenthal and E. Sciore. Propagating integrity information References among interrelated . In Proc. of the Second Working Con- + ference on Integrity and Internal Control in Information Systems, [ACM 99] S. Abiteboul, S. Cluet, T. Milo, P. Mogilevsky, J. Simeon, pages 5–18, Warrenton, Virginia, November 1998. and S. Zohar. Tools for data translation and integration. IEEE Data Engineering Bulletin, 22(1):3–8, March 1999. [RS99] A. Rosenthal and E. Sciore. First class views: A key to user- centered . SIGMOD Record, 28(3):29–36, March 1999. [BB99] P. Bernstein and T. Bergstraesser. Meta-data support for data transformations using Microsoft Repository. IEEE Data Engineering [Sag] Sagent Technology. http://www.sagent.com/. Bulletin, 22(1):9–14, March 1999. [Shu87] N.C. Shu. Automatic data transformation and restructuring. [BDH+95] P. Buneman, S.B. Davidson, K. Hart, G.C. Overton, and In Proc. of the Third International Conference on Data Engineering, L. Wong. A data transformation system for biological data sources. pages 173–180, Los Angeles, California, February 1987. In Proc. of the Twenty-first International Conference on Very Large [Squ95] C. Squire. and transformation for the data Data Bases, pages 158–169, Zurich, Switzerland, September 1995. warehouse. In Proc. of the ACM SIGMOD International Conference [CD97] S. Chaudhuri and U. Dayal. An overview of data warehousing on Management of Data, pages 446–447, California, May 1995. and OLAP technology. SIGMOD Record, 26(1):65–74, March 1997. [Sto75] M. Stonebraker. Implementation of integrity constraints and [CR99] K.T. Claypool and E.A. Rundensteiner. Flexible database trans- views by query modification. In Proc. of the ACM SIGMOD Interna- formations: The SERF approach. IEEE Data Engineering Bulletin, tional Conference on Management of Data, pages 65–78, California, 22(1):19–24, March 1999. May 1975. [Cui01] Y. Cui. Lineage tracing in data warehouses. Ph.D. Thesis, [WS97] A. Woodruff and M. Stonebraker. Supporting fine-grained data Computer Science Department, Stanford University, 2001. lineage in a database visualization environment. In Proc. of the Thir- [CW00] Y. Cui and J. Widom. Practical lineage tracing in data ware- teenth International Conference on Data Engineering, pages 91–102, houses. In Proc. of the Sixteenth International Conference on Data Birmingham, UK, April 1997. Engineering, pages 367–378, San Diego, California, February 2000.