International Journal of Warehousing & Mining, 5(3), 1-27, July-September 2009 

A Survey of Extract–Transform– Load Technology

Panos Vassiliadis, University of Ioannina, Greece

Abstract

The software processes that facilitate the original loading and the periodic refreshment of the contents are commonly known as Extraction-Transformation-Loading (ETL) processes. The intention of this survey is to present the research work in the field of ETL technology in a structured way. To this end, we organize the coverage of the field as follows: (a) first, we cover the conceptual and logical modeling of ETL processes, along with some design methods, (b) we visit each stage of the E-T-L triplet, and examine problems that fall within each of these stages, (c) we discuss problems that pertain to the entirety of an ETL process, and, (d) we review some research prototypes of academic origin. [Article copies are available for purchase from InfoSci-on-Demand.com]

Keywords: ETL, data warehouse refreshment

INTRODUCTION the operational sources suffer from quality problems, ranging from simple misspellings A data warehouse typically collects data from in textual attributes to value inconsistencies, several operational or external systems (also database constraint violations and conflicting or known as the sources of the data warehouse) missing information; consequently, this kind of in order to provide its end-users with access “noise” from the data must be removed so that to integrated and manageable information. In end-users are provided data that are as clean, practice, this task of (also known complete and truthful as possible. Third, since as data warehouse population) has to overcome the information is constantly updated in the several inherent problems, which can be shortly production systems that populate the warehouse, summarized as follows. First, since the differ- it is necessary to refresh the data warehouse ent sources structure information in completely contents regularly, in order to provide the users different schemata the need to transform the with up-to-date information. All these problems incoming source data to a common, “global” require that the respective software processes data warehouse schema that will eventually are constructed by the data warehouse develop- be used by end user applications for querying ment team (either manually, or via specialized is imperative. Second, the data coming from tools) and executed in appropriate time intervals

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.  International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

for the correct and complete population of the measures (keep in mind that ExtendedPrice data warehouse. refers to the money clients pay per line item, The software processes that facilitate the PartKey is the primary key for items and Sup- population of the data warehouse are commonly pKey is the primary key for suppliers): known as Extraction-Transformation-Loading (ETL) processes. ETL processes are responsible • View A: aggregate profit and average dis- for (i) the extraction of the appropriate data from count grouped by PartKey and SuppKey the sources, (ii) their transportation to a special- • View B: average profit and extended price purpose area of the data warehouse where they grouped by PartKey and LineStatus will be processed, (iii) the transformation of • View C: aggregate profit and extended price the source data and the computation of new grouped by LineStatus and PartKey values (and, possibly records) in order to obey • View D: aggregate profit and extended price the structure of the data warehouse relation grouped by LineStatus to which they are targeted, (iv) the isolation and cleansing of problematic tuples, in order As one can observe, an ETL process is to guarantee that business rules and database the synthesis of individual tasks that perform constraints are respected and (v) the loading of extraction, transformation, cleaning or loading the cleansed, transformed data to the appropri- of data in an execution graph – also referred ate relation in the warehouse, along with the to as a workflow. Also, due to the nature of the refreshment of its accompanying indexes and design artifact and the user interface of ETL materialized views. tools, an ETL process is accompanied by a A naïve, exemplary ETL scenario imple- plan that is to be executed. For these reasons, mented in MS SQL Server Integration Ser- in the rest of this survey we will use the terms vices is depicted in Figure 1. The scenario is ETL process, ETL workflow and ETL scenario organized in two parts. The first part, named interchangeably. Extraction task (Figure 1a), is responsible for The historical background for ETL pro- the identification of the new and the updated cesses goes all the way back to the birth of rows in the source table LINEITEM. The idea information processing software. Software for is that we have an older snapshot for line items, transforming and filtering information from one stored in table LINEITEM which is compared (structured, semi-structured, or even unstruc- to the new snapshot coming from the sources of tured) file to another has been constructed since the warehouse (depicted as NEW_LINEITEM) the early years of data banks, where the relational in Figure 1b. Depending on whether a row is model and declarative database querying were (a) newly inserted, or, (b) an existing tuple not invented. Data and software were considered that has been updated, it is routed to the table an inseparable duo for by that LINEITEM, via the appropriate transformation time and thus, this software was not treated as a (remember that insertions and updates cannot stand-alone, special purpose module of the in- be uniformly handled by the DBMS). Once the formation system’s architecture. As Vassiliadis table LINEITEM is populated, the second part and Simitsis (2009) mention “since then, any of the scenario, named Transform & Load task kind of data processing software that reshapes (Figure 1a) is executed. This task is depicted or filters records, calculates new values, and in Figure 1c and its purpose is to populate populates another data store than the original with the update information several tables in one is a form of an ETL program.” the warehouse that act as materialized views. After the relational model had been born The scenario first computes the value for the and the declarative nature of relational database attribute Profit for each tuple and then sends the querying had started to gain ground, it was quite transformed rows towards four “materialized” natural that the research community would views that compute the following aggregated try to apply the declarative paradigm to data

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 

Figure 1. The environment of Extraction-Transformation-Loading processes

(a) Control flow of an ETL scenario (b) Simple extraction part of an ETL scenario

(c) Computation of extra values and multiple aggregations as part of an ETL scenario

transformations. The EXPRESS system (Shu, till then, the research community had typically Housel, Taylor, Ghosh, & Lum, 1977) is the hidden the internals of ETL process “under the first attempt that we know with the purpose of carpet” by treating the data warehouse as a set producing data transformations, taking as input of materialized views over the sources. At the data definitions or the involved nonprocedural same time, the industrial vendors were focused statements. During the later years, the emphasis on providing fast querying and reporting facili- on the problem was significant, ties to end users. Still, once data warehouses and wrapper-based exchange of data between were established as a practice, it was time to integrated database systems was the closest focus on the tasks faced by the developers. As thing to ETL that we can report – for example, a result, during the ‘00s, the industrial field is see Roth and Schwarz (1997). flourishing with tools from both the major da- ETL has taken its name and existence as a tabase vendors and specialized companies and, separate set of tools and processes in the early at the same time, the research community has ‘00s. Despite the fact that data warehouses had abandoned the treatment of data warehouses as become an established practice in large organi- collections of materialized views and focuses zations since the latest part of the ‘90s, it was on the actual issues of ETL processes. only in the early ‘00s that both the industrial The intention of this survey is to present the vendors and the research community cared to research work in the field of ETL in a structured deal seriously with the field. It is noteworthy that way. The reader is assumed to be familiar with

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.  International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

the fundamental concepts of databases and ETL processes could not escape the above rule data warehousing. Since the focus is on ETL and, therefore, they have attracted the attention processes, we will avoid the detailed cover- of the research community. A typical reason for age of research topics like materialized view this attention is also the fact that mainstream refreshment and data cleaning that are close to industrial approaches –see for example Kim- ETL processes (in fact, in practical situations, bal, Reeves, Ross & Thornthwaite (1998), or these tasks can be important constituents of Kimball & Caserta (2004)– focus on the physi- an ETL workflow) but have an existence of cal-level details and lack a principled approach their own in the research literature. Therefore, towards the problem of designing a data ware- the coverage of these topics mainly includes a house refreshment process. In this section, we discussion of the problems under investigation will discuss the appearance of research efforts and refers the reader to dedicated surveys with for the conceptual modeling of ETL processes a broad discussion of the state of the art. with a chronological perspective and also cover The discussion will start with the examina- some efforts concerning the logical modeling tion of design issues and then, it will proceed and design methods for this task. to cover technical issues for ETL processes. In Section 2, we cover the conceptual and logical UML Meta Modeling for Data modeling of ETL processes, along with some Warehouses design methods. In Section 3, we visit each stage of the E-T-L triplet, and examine problems that The first approaches that are related to the fall within each of these stages. Then, in Sec- conceptual design of data warehouses were tion 4, we discuss problems that pertain to the hidden in discussions pertaining to data ware- entirety of an ETL process (and not just in one house . Stöhr, Müller, & Rahm (1999) of its components), such as issues around the propose an UML‑based metamodel for data optimization, resumption and benchmarking of warehouses that covers both the back stage ETL processes, along with a discussion of the and the front‑end of the data warehouse. Con- newest trend in ETL technology, near-real time cerning the back stage of the data warehouse, ETL. In Section 5, we review some research which is the part of the paper that falls in the prototypes with academic origin. Finally, in scope of this survey, the authors cover the Section 6, we conclude our coverage of the workflow from the sources towards the target topic with an eye for the future. data stores with entities like Mapping (among entities) and Transformation (further classi- fied to aggregations, filters, etc.) that are used STATE OF THE ART FOR THE to trace the inter-concept relationships in this DESIGN AND MODELING OF workflow environment. The overall approach ETL PROCESSES is a coherent, UML-based framework for data warehouse metadata, defined at a high-level Traditionally, a large part of the literature, the of abstraction. The main contribution of the research activity, and the research community of authors is that they provide a framework where data warehouses is related to the area of concep- specialized ETL activities (e.g., aggregations, tual modeling and design. To a large extent, this cleanings, pivots, etc) can be plugged in easily is due to the fact that data warehouse projects via some kind of specialization. are highly costly and highly risky endeavors; therefore, careful design and preparation are First Attempts towards a necessary. Moreover, due to their complexity, Conceptual Model data warehouse environments should be care- fully documented for maintenance purposes. The first attempt towards a conceptual model dedicated to the design and documentation of the

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009  data warehouse refreshment was presented by UML Revisited for ETL Processes Vassiliadis, Simitsis & Skiadopoulos (DOLAP 2002). The main motivation for the model was Trujillo & Luján-Mora (2003) revisit the con- the observation that during the earliest stages ceptual modeling of ETL workflows from the of the data warehouse design, the designer is view point of UML with the basic argument concerned with the analysis of the structure that the previous modeling by Vassiliadis et al and content of the existing data sources and is done via an ad-hoc model. So, the authors their mapping to the common data warehouse try to facilitate the modeling effort for ETL model. Therefore, a formal model for this task workflows with standard methods and they was necessary at the time. employ UML for this purpose. The model of Vassiliadis et al (DOLAP It is interesting that the authors employ 2002) involves concepts (standing for source class diagrams and not activity diagrams for and warehouse data holders) and their attributes, their modeling. The participating entities are which define their internal structure. Part-of UML packages; this is a powerful feature of the relationships correlate concepts and their model, since it allows the arbitrary nesting of constituent attributes. Attributes of source and tasks. This nesting mechanism, quite common warehouse concepts are related to each other to UML, alleviates the complexity of the model with provider relationships. If a transformation of Vassiliadis et al., since it allows a gradual takes place during the population of a target zooming in and out of tasks at different levels attribute, then the provider relationship passes of detail. The main reason for dealing with class through a transformation node in the model. diagrams is that the focus of the modeling is Multiple transformations are connected via on the interconnection of activities and data serial composition relationships. The model stores and not on the actual sequence of steps allows the definition of ETL constraints that that each activity performs. Under this prism, signify the need for certain checks at the data whenever an activity A1 populates an activity (e.g., a certain field must be not null or within A2, then A2 is connected to A1 with a dependency a certain range value, etc). Finally, multiple association. source candidates for the population of a ware- Then, Trujillo & Luján-Mora (2003) pro- house concept are also tracked via candidate vide a short description for a set of commonly relationships. This is particularly useful for the encountered activities. Each such template early stages of the design, where more than one activity is graphically depicted as an icon. The sources can be chosen for the population of a activities covered by the authors are: aggrega- warehouse fact table. If one of them is eventu- tion, conversion, logging, filtering, join, and ally chosen (e.g., due to its higher ), loading of data, as well as checks for incorrect then this source concept is characterized as the data, merging of data coming from different active candidate for the model. sources, wrapping of various kinds of external A second observation of Vassiliadis et al data sources and surrogate key assignment. (DOLAP 2002) was that is practically impos- The authors build their approach on a ge- sible to forecast all the transformations and neric structure for the design process for ETL cleanings that a designer might ever need. So, workflows. So, they structure their generic instead of proposing a closed set of transforma- design process in six stages, specifically, (i) tions, the authors discuss the extensibility of the source selection, (ii) source , model with template transformations that are (iii) source data join, (iv) target selection, (v) defined by the designer. attribute mappings between source and target data and (vi) data loading.

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.  International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

UML and Attribute Mappings the relation classes via stereotyped “Contain” relationships. Attributes can be related to each One of the main assumptions of the approach of other via stereotyped “Map” relationships. Trujillo & Luján-Mora (2003) was that the user A particular point of emphasis made by must not be overwhelmed with the multitude Luján-Mora et al. (2004) is the requirement of attribute mappings between sources, ETL for multiple, complementary diagrams at dif- activities and target warehouse tables. Still, as ferent levels of detail. The authors propose four already mentioned, such detail is important for different layers of data mappings, specifically, the back-stage of the data warehouse. Without (a) the database level, where the involved data- capturing the details of the inter-attribute bases are represented as UML packages, (b) the mappings, important transformations, checks dataflow level, where the relationships among and contingency actions are not present in source and target relations are captured, each the documentation of the process and can be in a single UML package, (c) the table level, ignored at the construction/generation of code. where the dataflow diagram is zoomed in and Despite the effort needed, this documentation each individual transformation is captured as can be useful at the early stages of the project, a package, and (d) the attribute level, which where the designer is familiarized with the offers a zoom-in to a table-level data mapping internal structure and contents of the sources diagram, with all the attributes and the individual (which include data quality problems, cryptic attribute level mappings captured. codes, conventions made by the administrators and the programmers of the sources and so on). State-of-the-Art at the Logical Moreover, this documentation can also be useful Level during later stages of the project where the data schemata as well as the ETL tasks evolve and Apart from the conceptual modeling process sensitive parts of the ETL workflow, both at the that constructs a first design of the ETL process, data and the activities, need to be highlighted once the process has been implemented, there and protected. is a need to organize and document the meta- It is interesting that no standard formalism information around it. The organization of the like the ER model or the UML treats attributes metadata for the ETL process constitutes its as first-class citizens–and as such, they are un- logical level description – much like the system’s able to participate in relationships. Therefore, catalog acts as the logical level description of Luján-Mora, Vassiliadis & Trujillo (2004) stress a relational database. the need to devise a mechanism for capturing Davidson & Kosky (1999) present WOL, the relationships of attributes in a way that a Horn-clause language, to specify transforma- is (a) as standard as possible and (b) allows tions between complex types. The transforma- different levels of zooming, in order to avoid tions are specified as rules in a Horn-clause overloading the designer with the large amount language. An interesting idea behind this ap- of attribute relationships that are present in a proach is that a transformation of an element data warehouse setting. can be decomposed to a set of rules for its ele- To this end, the authors devise a mechanism ments, thus avoiding the difficulty of employing for capturing these relationships, via a UML complex definitions. data mapping diagram. UML is employed As already mentioned, the first attempt as a standard notation and its extensibility towards a systematic description for the meta- mechanism is exploited, in order to provide a data of the ETL process go back to the works standard model to the designers. Data mapping by Stöhr et al. (1999) and Vassiliadis, Quix, diagrams treat relations as classes (like the Vassiliou & Jarke (2001). This research has UML relational profile does). Attributes are been complemented by the approach of Vas- represented via proxy classes, connected to siliadis, Simitsis & Skiadopoulos (DMDW

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 

2002) where a formal logical model for ETL fundamental idea of these papers is to describe process is proposed. The main idea around this the meta-information via a series of intermediate model concerns the capturing of the data flow schemata. Coarsely speaking, the of from the sources to the warehouse. This means the activity are related to this graph of internal, that meta-information concerning relations or intermediate schemata via a simple convention files and their schemata is kept, along with the that a schema corresponds to a predicate in meta-information on the activities involved and rule-based language. LDL, a Datalog variant their semantics. Naturally, the interconnection is the chosen language for this line of papers. of all these components in a workflow graph that Then, each rule of the form implements the ETL process is also captured. The proposal of Vassiliadis et al (DMDW 2002) OUTPUT<-INPUT, filters, functions, models the ETL workflow as a graph. The nodes input-to-output mappings of the graph are activities, recordsets (uniformly modeling files and relations) and attributes; practically stands for a combination of inputs, note that recordsets have a schema in the typi- outputs, comparison nodes and functions that cal way and activities have input schemata, as connect the respective schemata via provider well as an output and a rejection schema (the edges (see the long version of Vassiliadis et al, latter serving the routing of offending rows ER 2005 for a detailed description). to quarantine). Part-of edges connect the at- The works by Vassiliadis et al. (CAiSE tributes with their encompassing nodes and 2003, IS 2005) also present a template language provider relationships show how the values are that allows ETL designers to define reusable propagated among schemata (in other words, templates of LDL programs via the appropriate provider relationships trace the dependencies macros (see also the discussion in the section in terms of data provision between one or more “Systems” for the ARKTOS tool). data provider attributes and a populated, data From the very beginning, this line of consumer attribute). Derived relationships also work was concerned with the exploitation of capture the transitive dependency of attributes the meta-information for ETL processes. The that are populated with values that are computed papers by Vassiliadis et al. (DMDW 2002; via functions and parameters. An ETL workflow ER 2005) are concerned with metrics for the is a serializable combination of ETL activities, identification of important properties of an ETL provider relationships and data stores. The design. The metrics proposed by Vassiliadis et overall modeling of the environment is called al. (DMDW 2002) are simple but quite power- Architecture Graph; in other words, an archi- ful and capture the degree of dependence of a tecture graph is the logical level description of node to other nodes and vice versa. The metrics the data flow of an ETL process. proposed by Vassiliadis et al. (ER 2005) are The above mentioned model suffered from based on a rigorous framework for metrics of the lack of concrete modelling of the semantics graph-based software constructs and show the of the activities as part of the graph. In other size, cohesion, coupling, and complexity of a words, the provider relationships capture only constructed ETL process. the dependencies of a consumer attribute to its data providers, without incorporating the actual Semantics-aware Design Μethods filterings and transformations that take place on for ETL the way from the provider to the consumer. This shortcoming was complemented by the work of A semi-automatic transition from the conceptual Vassiliadis, Simitsis, Georgantas, Terrovitis, & to the logical model for ETL processes has been Skiadopoulos (CAiSE 2003, IS 2005, DaWaK proposed first by Simitsis (2005) and later by 2005, ER 2005). Due to the complicated nature Simitsis & Vassiliadis (DSS 2008). Simple of the internal semantics of the activities, the mappings involve the mapping of concepts to

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.  International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

relations, and the mapping of transformations INDIVIDUAL OPERATORS AND and constraint checks to the respective activi- TASKS IN ETL SCENARIOS ties. The hardest problem that is dealt with is the mapping of the declarative requirements for In this section, we organize the review transformations and checks at the conceptual of the literature based on the constituents of level to a sequence of activities at the logical the ETL process, specifically, the extraction, level. The paper proposes an algorithm that transformation (&cleaning), and loading phases. groups transformations and checks in stages, For each phase, we discuss practical problems with each stage being a set of activities whose and solutions proposed by the research com- execution order can be transposed. Stage deriva- munity. tion is based on the idea of dependency: if the input attributes of a certain activity a depend Research Efforts Concerning Data on the (output) attributes of another activity or recordset b, then a should be on a higher order Extraction Tasks stage than b. Binary activities and persistent The extraction is the hardest part of the refresh- record sets typically act as boundaries for stages. ment of the data warehouse. This is due to two Since the presented algorithms start with sim- facts. First, the extraction software must incur ple cases of single source – single target pairs, minimum overheads to the source system both binary activities used as pivotal points of the at runtime and at the nightly time window that scenario for the derivation of common sub-paths is dedicated to the refreshment of the ware- (after the output of the binary activity). Overall, house. Second, the extraction software must be since the ordering of stages is quite simple, the installed at the source side with minimum effect determination of the execution orders for all the to the source’s software configuration. Typical activities is greatly simplified. techniques for the task include taking the dif- Skoutas & Simitsis (DOLAP 2006, ference of consecutive snapshots, the usage of IJSWIS 2007) use ontologies to construct the timestamps (acting as transaction time) in source conceptual model of an ETL process. Based relations, or the “replaying” the source’s log file on domain knowledge of the designer and user at the warehouse side. Non-traditional, rarely requirements about the data sources and the used techniques require the modification of the data warehouse, an appropriate OWL ontol- source applications to inform the warehouse on ogy is constructed and used to annotate the the performed alterations at the source, or, the schemas of these data stores. Then, according usage of triggers at the source side. to the ontology and these annotations, an OWL Snapshot difference between a newer and reasoner is employed to infer correspondences an older snapshot of a relation seems straightfor- and conflicts between the sources and the tar- ward: In principle, the identification of records get, and to propose conceptual ETL operations that are present in one snapshot and not in the for transferring data between them. Skoutas other gives us the insertions and deletions and Simitsis (NLDB 2007) use ontologies to performed; updates refer to two tuples that document the requirements of an ETL process. are present in the two snapshots and share the The proposed method takes a formal, OWL same primary key, but different non-key values. description of the semantic descriptions of (a) Despite this theoretical simplicity, performance the data sources, (b) the data warehouse, as considerations are very important and pose a well as (c) the conceptual specification of an research problem. ETL process and translates them to a textual The research community has dealt with format that resembles natural language. This the problem from the mid ‘80s. Lindsay, Haas, is facilitated via a template-based approach for Mohan, Pirahesh, & Wilms (1986) propose a the constructs of the formal description. timestamp based algorithm for detecting inser-

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009  tions, deletions and updates in two snapshots. ploits the idea that the tuples that are present in A simple algorithm annotating changes with both snapshots are found in approximately the timestamps and comparing the values for tuples same place in the two snapshots. The algorithm changed after the last comparison is proposed. uses an input buffer and a buffer for candidate Improvements concerning the identification modified tuples (ageing buffer in the paper’s of “empty areas” in the table speed up the terminology) per snapshot. The algorithm fills process further. Labio & Garcia-Molina (1996) the input buffers and compares their contents. have presented the state of the art method on Tuples found identical in the two buffers are no the topic of snapshot difference. The paper longer considered. Misses are tested over the assumes two snapshots as input and produces ageing buffer of the other snapshot. If a match a sequence of insertion, deletion and update occurs, the tuples are further ignored. The tuples actions to tuples, with each tuple being identi- that remain in the input buffers after these tests fied by its key. Several algorithms are discussed are candidates to be insertions or deletions and in the paper, including sort-merge outerjoins they are pushed to the ageing buffer of their and partitioned hash joins. These algorithms snapshot. Due to space requirements, if an can be extended with compression techniques ageing buffer is full, it must be emptied. To this in order to reduce the size of the processed end, a queue of pointers is maintained, keeping data and the incurred I/Os as well as in order track of the order in which the tuples entered to exploit the opportunity of fitting one of the the buffer; if the buffer is full, the oldest tuples two snapshots in main memory. Compression are emptied. The algorithm is quite efficient and is applied to each tuple individually (or to a safe if the same records are physically placed whole block) and interestingly, the employed in nearby areas in the two snapshots and the compression function is orthogonal to the al- experimental results have proved that this is a gorithm of choice, with the extra observation realistic assumption. that lossy compression methods might introduce erroneously identified modifications with very Research Efforts Concerning Data small probability. Overall, the investigated Transformation Tasks compression techniques include two methods, (a) simple tuple compression and (b) simple Although naïve data transformations are inher- tuple compression with a pointer to the original ently built inside SQL and relational algebra, uncompressed record. Then, the authors pro- transformations that are used in ETL scenarios pose several algorithms for the identification have not really found their way in the research of modifications. The first algorithm performs literature. The main reason for this is probably a sort-merge outer-join. Due to the anticipated due to the fact that transformations of this kind periodic execution of the algorithm it is safe to are typically ad-hoc and rather straightforward if assume that the previous snapshot is sorted. The studied individually; the complexity arises when algorithm sorts the new snapshot into runs and their combination in a workflow is introduced. proceeds as the typical sort-merge join with the Nevertheless, there are some approaches that extra fundamental checks that are mentioned try to deal with the way to transform input to above, concerning the characterization of a tuple output data efficiently, and elegantly in terms as an insertion (deletion), update, or existing of semantics. entry. A variant of the algorithm involves the usage of compressed snapshot. In this case, the The Pivoting Problem pointer to the original tuple can help with the identification of updates. Another variant of the Pivoting refers to a common spreadsheet opera- algorithm concerning the hash join is also dis- tion for the tabular presentation of data to the cussed. The main proposal of the paper, though, end user, which is also quite common in ETL is the so-called window algorithm, which ex- processes. Assume that a user wants to repre-

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 10 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

sent information about employees’ revenues, relational algebra is not equipped with an op- with the revenues of each employee split into erator that performs this kind of input-output categories like salary, tax, work bonus, and mapping, Carreira et al. extend it by proposing family bonus. There are two possible organiza- a new operator called data mapper and explore tions for this kind of data. First, a row-oriented its semantics and properties. In this discussion, organization is based on a table with the above we mainly focus on the work of Carreira et categories represented as attributes -- e.g., con- al (DKE, 2007) which is a long version of a

sider a relation of the form EMPr(EID, ESAL, previous work (DaWaK 2005). The authors ETAX, EWBonus, EFBonus). A second, attri- define the data mapper operator as a comput- bute-value representation splits the revenues able function mapping the space of values of an of each employee into different rows – e.g., input schema to the space of values of an output

consider a relation of the form EMPav(EID, schema. Mappers are characterized as single or RID, Amount) with RID being a foreign key to a multi-value mappers, depending on whether table REVENUE_CATEGORIES(RID, RDescr) one or more tuples occur at the output given that contains values {(10, Sal), (20,Tax), …}. an arbitrary tuple at the input of the operator. Pivoting is the operation that transforms relation Mappers under investigation are minimalistic

EMPav to EMPr; unpivot is the inverse opera- operators and should be highly cohesive (i.e., tion. Cunningham, Galindo-Legaria & Graefe they should do exactly one job); to this end, (2004) discuss efficient ways to implement a mapper is defined to be in normal form if it these two operations in a DBMS environment. cannot be expressed as the composition of two The authors start by suggesting compact exten- other mappers. Carreira et al. propose algebraic sions of SQL to allow the end user to directly rewritings to speed up the execution of compos- express the fact that this is the requested op- ite expressions of the extended relation algebra. eration (as opposed to asking the user perform Specifically, a data mapper can be combined pivot/unpivot through complicated expressions with a (originally subsequent) selection condi- that the optimizer will fail to recognize as one tion, if the condition operates on a parameter of operation). Special treatment is taken for data a mapper function. A second rule directs how collisions and NULL values. The authors build a selection condition that uses attributes of the upon the fact that pivot is a special case of ag- output of a data mapper can be translated to gregation, whereas unpivot is a special case of the attributes that generate them and thus be a correlated nested loop self join and explore pushed through the mapper. A third rule deals query processing and execution strategies, as with how projection can help avoid unnecessary well as query optimization transformations, computations of mappers that will be subse- including the possibilities of pushing projec- quently projected-out later. Finally, Carreira tions and selections down in execution plans et al (ICEIS 2007) report some first results on that directly involve them. their experimentation with implementing the data mappers in a real RDBMS. A more detailed Data Mappers description of alternative implementations is given by Carreira et al (QDB 2007) concern- A sequence of papers by Carreira et al. (DaWaK ing unions, recursive queries, table functions, 2005, DKE 2007, ICEIS 2007) explore the stored procedures and pivoting operations as possibilities imposed by data transformations candidates for the implementation of the data that require one-to-many mappings, i.e., trans- mapper operator. The first four alternatives formations that produce several output tuples where used for experimentation in two different for each input tuple. This kind of operations is DBMSs and table functions appear to provide typically encountered in ETL scenarios. Since the highest throughput for data mappers.

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 11

The Data Lineage Problem • A transformation is a black-box, if it is neither a dispatcher nor an aggregator. Apart from efficiently performing the transfor- mations in an ETL process, it is also important to Several other sub-categories concerning the be able to deal with the inverse problem. Given mapping of input to output values on the basis a certain warehouse tuple (or, set of tuples), the of the keys of the input and the output are also data lineage problems involves the identifica- defined. The main idea is that if a certain value tion of the input data that are the originators of of an output (input) tuple can directly lead to its the given tuple (or, tuples). respective input (output) tuple(s), then this can Cui and Widom have investigated the prob- be exploited during lineage determination. For lem of data lineage in a set of papers (Cui & example, backward key transformations have Widom, 2001 & 2003). Here, we mostly focus on the property that given an output tuple, one can Cui & Widom (2003) that describes the authors’ determine the key of the input tuples that pro- method in more detail. This approach treats duced it; in this case, lineage is straightforward. each transformation as a procedure that can be In the general case, the lineage determination for applied to one or more datasets and produces dispatchers requires one pass of the input; the one or more datasets as output. Cui and Widom lineage determination of aggregators requires present a set of properties that a transformation several full scans of the input, whereas black can have, specifically (a) stability, if it never boxes have the entire input as their lineage. produces datasets as output without taking Given a sequence of transformations in an any datasets as input, (b) determinism and (c) ETL scenario, it is necessary to keep a set of completeness, if each input data item always intermediate results between transformations, contributes to some output data item. The au- in order to be able to determine the data lin- thors assume that all transformations employed eage of the output. Appropriate indexes may in their work are stable and deterministic and be used to relate each output to its input. To proceed to define three major transformation avoid storing all the intermediate results, the classes of interest: dispatchers, aggregators authors propose a normalization process that and black-boxes. allows the reordering of the transformations in such a way that adjacent transformations share • A transformation is a dispatcher, if each similar properties. Assuming two adjacent input data item produces zero or more out- transformations a and b, the idea is that their put items (with each output item possibly grouping is beneficial if one can determine the being derived by more than one input data lineage of a tuple in the output of b at the input items). A special category of dispatchers are of a, without storing any intermediate results. filters. A dispatcher is a filter if each input To this end, Cui and Widom propose a greedy item produces either itself or nothing. algorithm, called Normalize, which repeatedly • A transformation is an aggregator, if it is discovers beneficial combinations of adjacent complete and there exists a unique disjoint transformations and combines the best pair of partition of the input data set that contrib- transformations. utes to some output data item. An aggre- gator is context-free if the output group to Theoretical Foundations for ETL which an input data item is mapped can be Processes found by observing the value of this data item alone, independently of the rest of the The theoretical underpinnings concerning the data items of the input. An aggregator is internal complexity of simple ETL transforma- key preserving if all the input originators of tions are investigated by research that concerns an output tuple have the same key value. the data exchange problem. Assume a source and a target schema, a set of mappings that

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 12 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

specify the relationship between the source and Further results have to do with the abil- the target schema and a set of constraints at the ity to pose queries and obtain certain answers target schema. The problem of data exchange from a data exchange setting as well as with the studies whether we can materialize an instance management of inverse mappings relationships at the target, given an instance at the source, (Fagin, 2007). such that all mappings and constraints are re- spected. Data exchange is a specific variant of Data Cleaning the general data integration meta-problem, that requires the materialization of the result at the The area of data cleaning, although inherently target, with the ultimate purpose of being able related to the ETL process, practically consti- to subsequently pose queries to it, without the tutes a different field on its own. Covering this luxury of referring back to the source. Due to field adequately is well beyond the scope of this inherent characteristic of the problem, we this paper. The topic that has been in the center believe that the data exchange is the closest of attention of the research community in the theoretical problem to ETL that we know of. area of data cleaning concerns record matching, As Fagin, Kolaitis & Popa (2005) mention, with a particular emphasis on textual attributes. several problems arise around the data exchange Record matching refers to the problem of iden- problem, with particular emphasis on (a) ma- tifying records that represent the same object terializing the solution that reflects the source / fact in the real world, with different values. data as accurately as possible and (b) doing so Essentially, it is the case of textual fields that in an efficient manner. A first important result presents the major research challenge, since comes in the identification of the best possible their unstructured nature allows users to per- solutions to the problem, called universal solu- form data entry in arbitrary ways. Moreover, tions (Fagin, Kolaitis, Miller, & Popa, 2005). value discrepancies due to problems in the data Universal solutions have the good property of entry or data processing, as well as different having exactly the data needed for the data ex- snapshots of the representation of the real world change and can be computed in polynomial time fact in the database contribute to making the via the chase. It is interesting that if a solution problem harder. Typically, the identification of exists, then a universal solution exists, too, and duplicates requires an efficient algorithm for every other solution has a homomorphism to it. deciding which tuples are to be compared and At the same time, Fagin et al (TODS 2005) deal a distance (or, similarity) metric based on the with the problem of more than one universal values of the compared tuples (and possibly, solution for a problem and introduce the no- some knowledge by an expert). tion of a core, which is the universal solution For recent, excellent surveys of the field, of the smallest size. For practical cases of data the reader is encouraged to first refer to a survey exchanges, polynomial time algorithms can be by Elmagarmid, Ipeirotis & Verykios (2007) used for the identification of the core. It is worth as well as to a couple of tutorials by Koudas noting that the schema mappings explored are & Srivastava (2005) and Koudas, Sarawagi & investigated for the case of tuple-generating Srivastava (2006) in the recent past. dependencies, which assure that for each source tuple x, whenever a conjunctive formula applies Research Efforts Concerning Data to it, then there exists a target tuple y, such that Loading Tasks a conjunctive formula over x and y applies too (in other words, we can assure that mappings Typically, warehouse loading tasks take place and constraints are respected). To our point of in a periodic fashion, within a particular time view, this is a starting point for the investiga- window during which, the system is dedicated tion of more complex mappings that arise in to this purpose. Bulk loading is performed (a) ETL settings. during the very first construction of the ware-

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 13 house and (b) in an incremental way, during its and efficient query answering of the cube tree. everyday maintenance. During the latter task, a Roussopoulos et al (1997) discuss the bulk set of new insertions, deletions and updates ar- loading and maintenance of cube trees that are rive at the warehouse, after being identified at the implemented over packed R-trees. The authors extraction phase and, subsequently transformed make the sharp observation that individual and cleaned. This set of data is typically called updates are practically impossible in terms of the ‘delta’ relation and has to be loaded to the performance and thus, bulk updates are neces- appropriate destination table. In all occasions, sary. The approach is based on sorting the delta loading is performed via vendor-specific bulk increment that is to be loaded and merging it loaders that provide maximum performance (see with the existing cube tree. Since all possible Vassiliadis & Simitsis, 2009 for a discussion on aggregates are kept by cube trees, for every delta the topic). Also, apart from the dimension tables tuple, all its possible projections are computed and the fact tables, indexes and materialized and stored in the appropriate buffer. When a views must also be maintained. buffer is full, it is sorted and then it is staged It is interesting that only the loading of for the subsequent merge that takes place once views and indexes has been investigated by all the delta increment has been processed. The the research community; in this subsection, authors also highlight that as time passes, the we give an overview of important references points of all the possible cubes of the multidi- in the bibliography. mensional space are covered with some values, which makes the incremental update even Maintenance of Indexes easier. Moreover, since the merging involves three parts (i) the old cube tree, (ii) the delta Concerning the general setting of index main- and (iii) the new cube tree, their merging can be tenance, Fenk, Kawakami, Markl, Bayer & efficiently obtained by using three disks, one for Osaki (2000) highlight the critical parameters each part. Roussopoulos, Kotidis & Sismanis of index bulk loading that include the minimi- (1999) discuss this possibility. zation of random disk accesses, disk I/O, CPU Fenk et al. (2000) propose two bulk loading load, and the optimization of data clustering and algorithms for the UB-Tree, one for the initial page filling. As typically happens with spatial and one for the incremental bulk loading of a indexes, Fenk et al., identify 3 stages in the UB-tree. The UB-Tree is a multidimensional bulk loading of a multidimensional warehouse index, which is used along with specialized index: (a) key calculation for each tuple of the query algorithms for the sequential access to data set, (b) sorting of the tuples on the basis the stored data. The UB-Tree clusters data ac- of their keys and (c) loading of the sorted data cording to a space filling Z–curve. Each point of into the index. the multidimensional space is characterized by Roussopoulos, Kotidis & Roussopoulos its Z-address and Z-addresses are organized in (1997) discuss the efficient construction and disjoined Z-regions mapped to the appropriate bulk load of cube trees. Cube trees and their disk pages. This way, a tree can be formed and variants are based on the fundamental idea of the location of specific points at the disk can finding an efficient data structure for all the be computed. Concerning the abovementioned possible aggregations of these data. Assume generic method for bulk loading that Fenk et al a relation with M attributes, out of which N, have highlighted, in the case of UB-trees, the N

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 14 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

are split in two and one of the two halves is stored and its current state, without accessing the as a UB-tree page. The incremental variant of base tables. the algorithm tries to identify the correct page b. The nature of the changes: apart from for inserting a tuple under consideration. simple insertions and deletions, it is an Dwarfs (Sismanis, Deligiannakis, Rous- issue whether updates are treated per se sopoulos & Kotidis, 2002) are indexes built by or as a pair (delete old value, insert new exploiting an appropriate ordering of the dimen- value). Also, sets of updates as opposed to sions and the exploitation of common prefixes individual updates can be considered. and suffixes in the data. A dwarf resembles a c. The number of materialized views that trie, with one level for each dimension of a cube are concurrently been updated: in the and appropriate pointers among internal levels. context of data warehousing, and ETL in High cardinality dimensions are placed highly particular, simply dealing with one view in the Dwarf, to quickly reduce the branching being updated is too simplistic. Typically, factor. Dwarf construction exploits a single, several materialized views (possibly related appropriate sorting of data and proceeds in a in some hierarchical fashion, where one top-down fashion. Identical sub-dwarfs (i.e., view can be derived from the other) have dwarfs generated from the same set of tuples to be simultaneously refreshed. of the fact table) are pinpointed and coalesced. d. The way the update is physically imple- Dwarfs constitute the state-of-the-art, both in mented: Typically, there are three ways querying and in updating sets of aggregate to refresh views, (a) on-update, i.e., the views and, when size and update time are instant a change occurs at the sources, (b) considered, they significantly outperform other on-demand, i.e., in a deferred way, only approaches. when someone poses a query to the view and (c), periodically, which is the typical Materialized View Maintenance case for data warehousing, so far.

View maintenance is a vast area by itself; There are several surveys that give point- covering the topic to the full of its extent is ers for further reading. The earliest of them far beyond the focus of this paper. The main was authored by Gupta & Mumick (1995). idea around view maintenance is developed One can also refer to a note by Roussopoulos around the following setting: Assume a set of (1998) and a chapter by Kotidis (2002). Gupta base tables upon which a materialized view is & Mumick (2006) is a recent paper that dis- defined via a query. Changes (i.e., insertions, cusses maintenance issues for a complicated deletions and updates) occur at the base tables class of views. The reader is also referred to resulting in the need to refresh the contents of papers by Mumick, Quass & Mumick (1997), the materialized view (a) correctly and (b) as Colby, Kawaguchi, Lieuwen, Mumick & Ross efficiently as possible. The parameters of the (1997), Labio, Yerneni & Garcia-Molina (1999), problem vary and include: and Stanoi, Agrawal & El Abbadi (1999), for the update of groups of views in the presence a. The query class: The area started by deal- of updates. ing with Select-Project-Join views, but the need for aggregate views, views with nested queries in their definition, outerjoins and HOLISTIC APPROACHES other more complicated components has significantly extended the field. The query In the previous section, we have dealt with class (along with several assumptions) can operators facilitating tasks that are located in also determine whether the materialized isolation in one of the three main areas of the ETL view can be updated solely with the changes process. In this section, we take one step back

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 15 and give a holistic view to research problems from one state to another. A transition involves that pertain to the entirety of the ETL process. a restructuring of the graph mainly in one of First, we discuss the problems of optimization the following ways: and resumption of entire ETL workflows. Sec- ond, we visit the practical problem of the lack • swapping of two consecutive activities if of a reference benchmark for ETL processes. this is feasible, with the goal of bringing Finally, we cover the newest topic of research in highly selective activities towards the be- the area, concerning the near real time refresh- ginning of the process (in a manner very ment of a data warehouse. similar to the respective query optimization heuristic in relational DBMS’s), Optimization • factorization of common (or, in the paper’s terminology, homologous) activities in a The minimization of the execution time of an workflow that appear in different paths ETL process is of particular importance, since that end in a binary transformation, with ETL processes have to complete their task the goal of applying a transformation only within specific time windows. Moreover, in the once, later in the workflow, possibly to unfortunate case where a failure occurs during fewer or sorted data, the execution of an ETL process, there must • distribution of common activities (the be enough time left for the resumption of the inverse of factorization) by pushing a workflow. Traditional optimization methods are transformation that appears late in the not necessarily applicable to ETL scenarios. As workflow towards its start, with the hope mentioned by Tziovara, Vassiliadis & Simitsis that a highly selective activity found after (2007) “ETL workflows are NOT big queries: a binary transformation is pushed early their structure is not a left-deep or bushy tree, enough in the workflow. black box functions are employed, there is a considerable amount of savepoints to aid faster Two other transitions, merge and split are resumption in cases of failures, and different used for special cases. The important problem servers and environments are possibly involved. behind the proposed transitions is that a restruc- Moreover, frequently, the objective is to meet turing of the graph is not always possible. For specific time constraints with respect to both example, it is important to block the swapping regular operation and recovery (rather than the of activities where the operation of the second best possible throughput).” For all these reasons, requires an attribute computed in the first. At the optimization of the execution of an ETL the same time, the detection of homologous process poses an important research problem activities requires the identification of activi- with straightforward practical implications. ties with common functionality over data with Simitsis, Vassiliadis & Sellis (ICDE 2005, similar semantics. An ontological mapping TKDE 2005) handle the problem of ETL op- of attributes to a common conceptual space timization as a state-space problem. Given an facilitates this detection. original ETL scenario provided by the ware- The paper uses a very simple cost model to house designer, the goal of the paper is to find assess the execution cost of a state and presents a scenario which is equivalent to the original three algorithms to detect the best possible al- and has the best execution time possible. Each gorithm. Apart from the exhaustive algorithm state is a directed acyclic graph with relations that explores the full search space, heuristics and activities as the nodes and data provider are also employed to reduce the search space relationships as edges. The authors propose a and speed up the process. method that produces states that are equivalent The work of Simitsis et al (ICDE 2005, to the original one (i.e., given the same input, TKDE 2005) for the optimization of an ETL they produce the same output) via transitions process at the logical level was complemented

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 16 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

with a paper on the optimization of ETL warehouse administrators – mainly due to the workflows at the physical level by Tziovara time limits (typically referred to as the time et al (2007). In this paper, the authors propose window) within which the loading process must a method that produces an optimal physical be completed. Therefore, the efficient resump- level scenario given its logical representation tion of the ETL process in the case of failures as an input. Again, the input ETL workflow is very important. is considered to be a directed acyclic graph Labio, Wiener, Garcia-Molina & Gorelik constructed as mentioned above. A logical- (2000) are concerned with the issue of ef- level activity corresponds to a template i.e., ficiently resuming an interrupted workflow. an abstract operation that is customized with Instead of redoing the workflow all over schema and parameter information for the ETL from the beginning, the authors propose a scenario under consideration. Each logical-level resumption algorithm, called DR, based on template is physically implemented by a variety the fundamental observation that whenever of implementations (much like a relational join an activity outputs data in an ordered fashion, is implemented by nested loops, merge-sort, then its resumption can start right where it was or hash join physical-level operators). Each interrupted. Activities are practically treated as physical operator is sensitive to the order of black boxes (where only the input and output the incoming stream of tuples and has a dif- are of interest) and a tree of activities is used ferent cost and needs of system resources (e.g., to model the workflow. Each path of the tree is memory, disk space, etc). Again, the problem characterized on the relationship of output to is modeled as a state-space problem with states input tuples and on the possibility of ignoring representing full implementations of all the some tuples. Each transformation in an ETL logical level activities with their respective process is characterized with respect to a set of physical-level operators. Transitions are of two properties that concern (a) the extent to which kinds: (a) replacement of a physical implemen- an input tuple produces more than one output tation and (b) introduction of sorter activities tuples (if not, this can be exploited at resump- which apply on stored recordsets and sort their tion time), (b) the extent to which a prefix or a tuples according to the values of some, critical suffix of a transformation can be produced by for the sorting, attributes. The main idea behind a prefix or a suffix of the input (in which case, the introduction of sorters is that order-aware resumption can start from the last tuple under implementations can be much faster than their process at the time of failure), (c) the order order-neutral equivalents and possibly outweigh produced by the transformation (independently the cost imposed by the sorting of data. As with of the input’s order) and the deterministic nature classical query optimization, existing ordering of the transformation. Combinations of these in the storage of incoming data or the reuse of properties are also considered by the authors. interesting orders in more than one operator Moreover, these properties are not defined only can prove beneficial. Finally, this is the first for transformations in isolation, but also, they paper where butterflies have been used as an are generalized for sequences of transforma- experimental tool for the assessment of the tions, i.e., the whole ETL process. proposed method (see the coming section on The resumption algorithm has two phases: benchmarking for more details). (a) design, where the activities of the workflow are characterized with respect to the aforemen- The Resumption Problem tioned properties and (b) resumption, which is based on the previous characterization and An ETL process typically processes several invoked in the event of failure. MB’s or GB’s of data. Due to the complex nature of the process and the sheer amount of • Design constructs a workflow customized data, failures create a significant problem for to execute the resumption of the original

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 17

workflow. To this end, re-extraction pro- Clearly, TPC-H is not related to ETL, since there cedures are assigned to extractors; these is no workflow of cleanings or transformations, procedures regulate whether all or part of no value computations and no routing of data the input will be extracted from the sources. from the sources to the appropriate targets in These re-extraction procedures are comple- the warehouse. mented with filters that are responsible for TPC-DS is a new Decision Support (DS) blocking source or intermediate tuples that workload being developed by the TPC (TPC- have already been processed – and most DS, 2005). This benchmark models the decision importantly, stored – during the regular support system of a retail product supplier, operation of the ETL process. including queries and data maintenance. The • Resume consists of the assignment of the relational schema of this benchmark is more correct values that must be assigned to complex than the schema presented in TPC- the resumption workflow’s filters, so that H. TPC-DS involves six star schemata (with a the tuples that are already stored in the large overlap of shared dimensions) standing for warehouse are blocked. Then, Resume Sales and Returns of items purchased via three performs the application of the re-extrac- sales channels: a Store, a Catalog and the Web tion procedures. Subsequently, the load channel. The structure of the schemata is more of the warehouse continues as in regular natural for data warehousing than TPC-H; still, operation. the schemata are neither pure stars, nor pure snowflakes. The dimensions follow a snowflake Benchmarking pattern, with a different table for each level; nevertheless, the fact table has foreign keys to Unfortunately, the area of ETL processes suffers all the dimension tables of interest (resulting from the absence of a thorough benchmark that in fast joins with the appropriate dimension puts tools and algorithms to the test, taking into level whenever necessary). TPC-DS provides consideration parameters like the complexity a significantly more sophisticated palette of of the process, the data volume, the amount of refreshment operations for the data warehouse necessary cleaning and the computational cost of than TPC-H. There is a variety of maintenance individual activities of the ETL workflows. processes that insert or delete facts, maintain The Transaction Processing Council has inventories and refresh dimension records, proposed two benchmarks for the area of deci- either in a history keeping or in a non-history sion support. The TPC-H benchmark (TPC-H, keeping method. To capture the semantics of 2008) is a decision support benchmark that the refreshment functions, warehouse tables consists of a suite of business-oriented ad-hoc are pseudo-defined as views over the sources. queries and concurrent data modifications. The The refreshment scenarios of TPC-DS require database describes a sales system, keeping in- the usage of functions for transformations and formation for the parts and the suppliers, and computations (with date transformations and data about orders and the supplier’s customers. surrogate key assignments being very popular). The relational schema of TPC-H is not a typi- Fact tables are also populated via a large number cal warehouse star or snowflake schema; on of inner and outer joins to dimension tables. the contrary, apart from a set of tables that are Overall, TPC-DS is a significant improvement clearly classified as dimension tables, the facts over TPC-H in terms of benchmarking the ETL are organized in a combination of tables acting process; nevertheless, it still lacks the notion of as bridges and fact tables. Concerning the ETL large workflows of activities with schema and process, the TPC-H requires the existence of value transformations, row routing and other very simple insertion and deletion SQL state- typical ETL features. ments that directly modify the contents of the A first academic effort for the benchmark- LINEITEM and ORDERS warehouse tables. ing of warehouses is found in Darmont, Ben-

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 18 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

tayeb, & Boussaïd (2005). Still, the benchmark stage. In this subsection, we will review some explicitly mentions ETL processes as future research efforts towards this direction. work and does not address the abovemen- Karakasidis, Vassiliadis & Pitoura (2005), tioned problems. A second approach towards propose a framework for the implementation of a benchmark for ETL processes is presented near real time warehouse (called “active” data in Vassiliadis, Karagiannis, Tziovara, Simitsis warehousing by the authors). Several goals are (2007). The authors provide a rough classifi- taken into consideration by the authors, and cation of template structures, which are called specifically: (a) minimal changes in the soft- butterflies, due to their shape. In the general ware configuration of the source, (b) minimal case, a butterfly comprises three elements. First, overhead on the source due to the continuity of a butterfly comprises aleft wing where source data propagation, (c) the possibility of smoothly data are successively combined (typically via regulating the overall configuration of the envi- join operations) towards a central point of stor- ronment in a principled way. The architecture of age (called the body of the butterfly). Finally, the system is based on pipelining: each activity the right wing comprises the routing and ag- behaves as a queue (and thus, it called an ETL gregation of the tuples that arrived at the body queue) that periodically checks the contents of towards materialized views (also simulating its queue buffers and passes a block of tuples reports, spreadsheets etc.). Depending on the to a subsequent queue once they are appropri- shape of the workflow (where one of the wings ately processed (i.e., filtered or transformed). might be missing), the authors propose several The queues of an ETL workflow form a queue template workflow structures and a concrete set network and pipelining takes place. The paper of scenarios that are materializations of these explores a few other possibilities: (a) queue templates. theory is used for the prediction of the behavior of the queue network, (b) a legacy application Near-Real-Time ETL over ISAM files was modified with minimal software changes and (c) web services were Traditionally, the data warehouse refreshment employed at the warehouse end to accept the process has been executed in an off-line mode, final blocks of tuples and load them to the during a nightly time window. Nowadays, busi- warehouse. The latter performed reasonably ness needs for on-line monitoring of source well, although the lack of lightweightness in side business processes drive the request for web service architectures poses an interesting 100% data freshness at the user’s reports at research challenge. all times. Absolute freshness (or real time The work by Luo, Naughton, Ellmann & warehousing as abusively mentioned) practi- Watzke (2006) deals with the problem of con- cally contradicts with one of the fundamental tinuous maintenance of materialized views. The reasons for separating the OLTP systems from continuous loading of the warehouse with new reporting tasks for reasons of load, lock conten- or updated source data is typically performed tion and efficiency (at least when large scale is via concurrent sessions. In this case, the exis- concerned). Still, due to the user requests, data tence of materialized views that are defined as warehouses cannot escape their transforma- joins of the source tables may cause deadlocks tion to data providers for the end users with (practically, the term ‘source tables’ should be an increasing rate of incoming, fresh data. For understood as referring to cleansed, integrated all these reasons, a compromise is necessary “replicas” of the sources within the warehouse). and as a result, the refreshment process moves This is due to the fact that the maintenance of

to periodic refresh operations with a period of the view due to an update of source R1 may

hours or even minutes (instead of days). This require a lookup to source relation R2 for the

brings near real time data warehousing in the match of new delta. If R2 is updated concurrently

with R1, then a deadlock may occur. To avoid

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 19 deadlock, Luo et al. propose the reordering update transactions. Since a reconciliation must of transactions in the warehouse. The authors be made between two conflicting requirements, assume that the refreshment process operates the authors assume that queries are tagged with over join or aggregate join materialized views. two scores, one for the requirement for fresh- The authors suggest several rules for avoiding ness and another for the requirement of query deadlocks. Specifically, the authors require that efficiency. To enable the near real time loading (a) at any time, only one of the relations of a of the warehouse, Thiele et al. propose a two- join view is updated, (b) data coming from the level scheduling mechanism. The first level of sources are not randomly assigned to update scheduling is dedicated in deciding whether sessions, but rather, all the modifications of a a user query or an update transaction will be tuple are routed to the same refreshment ses- executed; this decision is based on the sum of sion, and, (c) the traditional 2PL, tuple-level the scores of both requirements for all the que- locking mechanism is replaced by some higher ries. The winner sum determines if an update level protocol. Individual modifications to the or a query will be executed. The second level same relation are grouped in transactions and of scheduling resolves which transaction will a scheduling protocol for these transactions is be executed. To avoid implications with data proposed, along with some starvation avoidance correctness and to serve both goals better, the heuristics. A second improvement that Luo et authors resolve in two scheduling guidelines. al. propose is the usage of “pre-aggregation” for If an update transaction is to be executed, then, aggregate materialized views, which practically it should be the one related to the data that are involves the grouping of all the modifications going to be queried by a query at the beginning of a certain view tuple in the same transaction of the query queue. On the other hand, if a query to one modification with their net effect. This is to be executed (i.e., efficiency has won the can easily be achieved by sorting the individual first level of the scheduling contest), then the updates over their target tuple in the aggregate query with a higher need for Quality of Service materialized view. The experimental assess- is picked. To avoid starvation of queries with ment indicates that the throughput is signifi- low preference to quality of service, the QoS cantly increased as an effect of the reordering tags of all queries that remain in the queue are of transactions – especially as the number of increased after each execution of a query. modifications per transaction increases. Thomsen, Pedersen & Lehner (2008) dis- Thiele, Fischer & Lehner (2007) discuss cuss a loader for near real time, or right time the problem of managing the workload of the data warehouses. The loader tries to synchronize warehouse in a near real time warehouse (called the loading of the warehouse with queries that real time warehouse by the authors). The au- require source data with a specific freshness thors discuss a load-balancing mechanism that guarantee. The idea of loading the data when schedules the execution of query and update they are needed (as opposed to before they are transactions according to the preferences of the needed) produces the notion of right time ware- users. There are two fundamental, conflicting housing. The architecture of the middleware goals that the scheduling tries to reconcile. On discussed by the authors involves three tiers. the one hand, users want efficient processing of The first tier concerns the data producer, at the their queries and low response time. This goal is source side. The middleware module provided referred to as Quality of Service by the authors. for the producer captures the modification On the other hand, the users also want the ware- operations “insert” and “commit” for JDBC house data to be as up-to-date as possible with statements. The user at the source side can respect to their originating records at the sources. decide whether committed data are to become This goal is referred to as Quality of Data by available for the warehouse (in this case, this the authors. Two queues are maintained by the is referred to as materialization by the authors). algorithm, one for the queries and one for the On insert, the new source values are cached in

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 20 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

a source buffer and wait to be propagated to the tion of the algorithm, either in order to maximize next tier, which is called catalyst. Apart from throughput for a given memory budget, or in that, the source middleware provides callbacks order to minimize the necessary memory for a for regulating its policy as steal or no steal with given stream rate. In the case of thrashing, an respect to the flushing of committed data to the approximate version of the algorithm with load hard disk at the source side, as well as a third shedding strategies that minimize the loss of option for regulating its policy on the basis of output tuples is discussed. MeshJoin makes no its current load or elapsed idle time. The cata- assumption on the order, indexing, join condi- lyst side, at the same time, also buffers data for tion and join relationship of the joined stream the consumer. An interesting property of the and relation; at the same time it relates the catalyst is the guarantee of freshness for the stream rate with the available memory budget user. So, whenever a warehouse user requests and gives correctness guarantees for an exact data, he can set a time interval for the data result if this is possible, or, allows a lightweight he must have; then the catalyst check its own approximate result otherwise. buffers and communicates with the source-side middleware to compile the necessary data. A SYSTEMS precise synchronization of the data that must be shed from main memory via appropriate Industrial systems for ETL are provided both by main memory indexing is also discussed by the the major database vendors and the individual authors. The third tier, at the warehouse side, ETL-targeted vendors. Popular tools include operates in a REPEATABLE READ isolation Oracle Warehouse Builder (2008), IBM Datast- mode and appropriately locks records (via a age (2008), Microsoft Integration Services shared lock) so that the catalyst does not shed (2008) and Informatica PowerCenter (2008). them during their loading to the warehouse. There is an excellent survey by Barateiro & Finally, the authors discuss a simple program- Galhardas (2005) that makes a thorough discus- matic facility via appropriate view definitions sion and feature comparison for ETL tools both to allow the warehouse user retrieve the freshest of academic and industrial origin. Friedman, data possible in a transparent way. Beyer & Bitterer (2007) as well as Friedman & Polyzotis, Skiadopoulos, Vassiliadis, Bitterer (2007) give two interesting surveys of Simitsis & Frantzell (2007, 2008) deal with the area from a marketing perspective. In this an individual operator of the near real time section, we will discuss only academic efforts ETL process, namely the join of a continuous related to ETL systems. stream of updates generated at the sources with a large, disk resident relation at the warehouse AJAX side, under the assumption of limited memory. Such a join can be used in several occasions, The AJAX system (Galhardas, Florescu, Sha- such as surrogate key assignment, duplicate sha & Simon, 2000) is a data cleaning tool detection, simple data transformations etc. To developed at INRIA France that deals with this end, a specialized join algorithm, called typical data quality problems, such as duplicate MeshJoin is introduced. The main idea is that identification, errors due to mistyping and data the relation is continuously brought to main inconsistencies between matching records. This memory in scans of sequential blocks that tool can be used either for a single source or are joined with the buffered stream tuples. A for integrating multiple data sources. AJAX precise expiration mechanism for the stream provides a framework wherein the logic of a tuples guarantees the correctness of the result. data cleaning program is modeled as a directed The authors propose an analytic cost model that graph of data transformations that start from relates the stream rate and the memory budget. some input source data. Four types of data This way, the administrator can tune the opera- transformations are supported:

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 21

• Mapping transformations standardize data ing transformations and (b) the close linkage formats (e.g., date format) or simply merge to formal semantics and their representation. or split columns in order to produce more Arktos is accompanied by an ETL library that suitable formats. contains template code of built-in functions • Matching transformations find pairs of and maintains template code of user-defined records that most probably refer to same functions. Template activities are registered object. These pairs are called matching in the system and they can be reused for the pairs and each such pair is assigned a specification of a scenario (either graphically, similarity value. or via forms and declarative languages). The • Clustering transformations group together customization process results in producing an matching pairs with a high similarity value ETL scenario which is a DAG of ETL activi- by applying a given grouping criteria (e.g., ties, each specified as a parameterized software by transitive closure). module, having instantiated input and output • Merging transformations are applied to schemata, concrete parameters, and a special- each individual cluster in order to eliminate purpose schema for problematic records. The set duplicates or produce new records for the of templates is extensible to allow users register resulting integrated data source. their own frequently used transformations. Arktos also offers zoom-in/zoom-out ca- AJAX also provides a declarative language pabilities. The designer can deal with a scenario for specifying data cleaning programs, which in two levels of granularity: (a) at the entity or consists of SQL statements enriched with a zoom-out level, where only the participating set of specific primitives to express mapping, recordsets and activities are visible and their matching, clustering and merging transforma- provider relationships are abstracted as edges tions. Finally, a interactive environment is between the respective entities, or (b) at the supplied to the user in order to resolve errors attribute or zoom-in level, where the user can and inconsistencies that cannot be automatically see and manipulate the constituent parts of an handled and support a stepwise de- activity, along with their respective providers sign of data cleaning programs. The linguistic at the attribute level. aspects and the theoretic foundations of this tool Finally, it is noteworthy that the model of can be found in Galhardas, Florescu, Shasha, Vassiliadis et al., (Information Systems 2005) Simon & Saita (2001), and Galhardas, Florescu, comes with a mechanism for expressing the Shasha & Simon (1999), where apart from the semantics of activities in LDL. The expression presentation of a general framework for the of semantics can be done both at the template data cleaning process, specific optimization and the instance level and a macro language is techniques tailored for data cleaning applica- discussed in the paper for this purpose. The ap- tions are discussed. proach is based on the fundamental observation that the LDL description can be mapped to a ARKTOS useful graph representation of the internals of an activity, which allows both the visualization Arktos (Vassiliadis et al., Information Systems and the measurement of interesting properties 2005) is an ETL tool that has prototypically of the graph. been implemented with the goal of facilitating the design, the (re-)use, and the optimization HumMer - Fusion of ETL workflows. Arktos is based on the metamodel of Vassiliadis et al., (Information HumMer (Naumann, Bilke, Bleiholder & Weis, Systems 2005) for representing ETL activities 2006) is a tool developed in the Hasso-Plattner and ETL workflows and its two key features Institute in Potsdam that deals with the problem are (a) the extensibility mechanisms for reus- of data fusion. Data fusion is the task of identify-

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 22 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

ing and resolving different representations of the expression or a position in a string, divide a col- same object of the real world. HumMer splits the umn on the basis of a predicate (resulting in two process in three steps: (a) schema alignment, (b) columns, the first involving the rows satisfying duplicate detection and (c) the core of the data the condition of the predicate and the second fusion process. Schema alignment deals with the involving the rest), selection of rows on the basis problem of schema inconsistencies. Assuming of a condition, folding columns (where a set of the user is in possession of different data sets attributes of a record is split into several rows) representing the same entities of the real world, and unfolding. Optimization algorithms are also HumMer is equipped with a dedicated module provided for the CPU usage for certain classes that detects a small sample of duplicates in these of operators. The general idea behind Potter’s data sets and tries to find schema similarities. Wheel is that users build data transformations Duplicate detection is performed via compar- in iterative and interactive way. Specifically, ing tuples over a similarity threshold. Once users gradually build transformations to clean inconsistencies at the schema and tuple level the data by adding or undoing transformations have been resolved, it is time for the resolution on a spreadsheet-like interface; the effect of of value inconsistencies, which is referred to as a transformation is shown at once on records data fusion by Naumann et al (2006). FuSem visible on screen. These transformations are (Bleiholder, Draba & Naumann, 2007) is an specified either through simple graphical op- extension of HumMer with the purpose of erations, or by showing the desired effects on interactively facilitating the data fusion pro- example data values. In the background, Potter’s cess. Assuming that the user has different data Wheel automatically infers structures for data sets representing the same real world entities, values in terms of user-defined domains, and FuSem allows the SQL querying of these data accordingly checks for constraint violations. sources, their combination through outer join Thus, users can gradually build a transformation and union operations and the application of dif- as discrepancies are found, and clean the data ferent tuple matching operators. The extension without writing complex programs or enduring of SQL with the FYSE BY operator, proposed long delays. by Bleiholder & Naumann (2005) is also part of FuSem. FYSE BY allows both the alignment of different relations in terms of schemata and CONCLUDING REMARKS the identification of tuples representing the same real world object. An important feature This survey has presented the research work in of FuSem is the visualization of results, which the field of Extraction-Transformation-Loading is primarily based on different representations (ETL) processes and tools. The main research of Venn diagrams for the involved records and goals around which research has been organized the interactive exploration of areas where two so far can be summarized as follows. data sets overlap or differ. a. The first goal concerns the construction of Potter’s Wheel commonly accepted conceptual and logical modeling tools for ETL processes, with a Raman & Hellerstein (2000, 2001) present the view to a standardized approach. Potter’s Wheel system, which is targeted to b. The second goal concerns the efficiency of provide interactive data cleaning to its users. individual ETL operators. To structure the The system offers the possibility of performing discussion better, we have organized the several algebraic operations over an underlying discussion around the three main parts of data set, including format (application of a func- the E-T-L triplet, and examined problems tion), drop, copy, add a column, merge delimited that fall within each of these stages. So far, columns, split a column on the basis of a regular research has come up with interesting solu-

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 23

tions for the detection of differences in two need for completeness and freshness of data to snapshots of data, specific data transforma- be pressing from the part of the users. tions, duplicate detection mechanisms and As an overall conclusion, we believe that the practical and theoretical investigation design, algorithmic and theoretical results in the of the lineage of warehouse data. field of ETL processes are open to exploration c. A third research goal has been to devise both due to the present problems and on the algorithms for the efficient operation of the basis of the changing environment of computer entire ETL process. Specifically, research science, databases and user needs. has some first results for the optimization and resumption of entire ETL processes as well as some first investigations towards REFERENCES (near) real time ETL. d. A fourth goal of the academic world has Barateiro, J., & Galhardas, H. (2005). A Survey been the construction of tools for the fa- of Data Quality Tools. Datenbank-Spektrum 14, 15-21 cilitation of the ETL process. Bleiholder, J., & Naumann, F. (2005). Declarative Apparently, the field is quite new, and there Data Fusion - Syntax, Semantics, and Implementa- is still too much to be done. In the sequel, we tion. 9th East European Conference on Advances present a personal viewpoint on the possible in Databases and Information Systems (ADBIS advancements that can be made in the field. 2005), pp.: 58-73, Tallinn, Estonia, September 12- 15, 2005. Starting with the traditional ETL setting, there are quite a few research opportunities. Bleiholder, J., Draba, K., & Naumann, F. (2007). Individual operators are still far from being FuSem - Exploring Different Semantics of Data closed as research problems. The extraction Fusion. Proceedings of the 33rd International Confer- of data still remains a hard problem, mostly ence on Very Large Data Bases (VLDB 2007), pp.: due to the closed nature of the sources. The 1350-1353, University of Vienna, Austria, September 23-27, 2007. loading problem is still quite unexplored with respect to its practical aspects. The optimiza- Carreira P., Galhardas, H., Pereira, J., Martins, tion and resumption problems ‑along with the F., & Silva, M. (2007). On the performance of appropriate cost models- are far from maturity. one-to-many data transformations. Proceedings The absence of a benchmark is hindering future of the Fifth International Workshop on Quality in research (that lacks a commonly agreed way Databases (QDB 2007), pp.: 39-48, in conjunction with the VLDB 2007 conference, Vienna, Austria, to perform experiments). The introduction September 23, 2007 of design patterns for warehouse schemata that take ETL into consideration is also open Carreira, P., Galhardas, H., Lopes A., & Pereira J. research ground. (2007). One-to-many data transformations through At the same time, the presence of new data mappers. Data Knowledge Engineering, 62, heavily parallelized processors and the near 3, 483-503 certainty of disrupting effects in hardware and Carreira, P., Galhardas, H., Pereira, J., & Lopes, disk technology in the immediate future put A. (2005). Data Mapper: An Operator for Ex- all the database issues of efficiency back on pressing One-to-Many Data Transformations. 7th the table. ETL cannot escape the rule and both International Conference on Data Warehousing and individual transformations as well as the whole Knowledge Discovery (DaWaK 2005), pp.: 136-145, process can be reconsidered in the presence of Copenhagen, Denmark, August 22-26, 2005 these improvements. Most importantly, all these Carreira, P., Galhardas, H., Pereira, J., & Wichert, research opportunities should be viewed via the A. (2007). One-to-many data transformation opera- looking glass of near real time ETL, with the tions - optimization and execution on an RDBMS. Proceedings of the Ninth International Conference

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 24 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

on Enterprise Information Systems, Volume DISI, Fenk, R., Kawakami, A., Markl, V., Bayer, R., & ICEIS (1) 2007, pp.: 21-27, Funchal, Madeira, Por- Osaki, S. (2000) Bulk Loading a Data Warehouse tugal, June 12-16, 2007 Built Upon a UB-Tree. Proceedings 2000 Inter- national Database Engineering and Applications Colby, L., Kawaguchi, A., Lieuwen, D., Mumick, I., Symposium (IDEAS 2000),pp.: 179-187, September & Ross, K. (1997). Supporting Multiple View Main- 18-20, 2000, Yokohoma, Japan tenance Policies. In Proceedings ACM SIGMOD International Conference on Management of Data Friedman, T., & Bitterer, A. (2007). Magic Quad- (SIGMOD 1997), pp.: 405-416, May 13-15, 1997, rant for Data Quality Tools, 2007. Gartner RAS Tucson, Arizona, USA Core Research Note G00149359, 29 June 2007. Available at http://mediaproducts.gartner.com/re- Cui, Y., & Widom, J. (2001). Lineage Tracing for prints/businessobjects/149359.html (last accessed General Data Warehouse Transformations. Proceed- 23 July 2008) ings of 27th International Conference on Very Large Data Bases (VLDB 2001), pp.: 471-480, September Friedman, T., Beyer, M., & Bitterer, A. (2007). Magic 11-14, 2001, Roma, Italy Quadrant for Data Integration Tools, 2007. Gartner RAS Core Research Note G00151150, 5 October Cui, Y., & Widom, J. (2003). Lineage tracing for 2007. Available at http://mediaproducts.gartner. general data warehouse transformations. VLDB com/reprints/oracle/151150.html (last accessed 23 Journal 12, 1, 41-58 July 2008) Cunningham, C., Galindo-Legaria, C., & Graefe, G. Galhardas, H., Florescu, D., Shasha, D., & Simon, E. (2004). PIVOT and UNPIVOT: Optimization and (1999). An Extensible Framework for Data Cleaning. Execution Strategies in an RDBMS. Proceedings of Technical Report INRIA 1999 (RR-3742). the Thirtieth International Conference on Very Large Data Bases (VLDB 2004), pp. 998-1009, Toronto, Galhardas, H., Florescu, D., Shasha, D., & Simon, E. Canada, August 31 - September 3 2004. (2000). Ajax: An Extensible Data Cleaning Tool. In Proc. ACM SIGMOD International Conference on Darmont, J., Bentayeb, F., & Boussaïd, O. (2005). the Management of Data, pp. 590, Dallas, Texas. DWEB: A Data Warehouse Engineering Benchmark. Proceedings 7th International Conference Data Galhardas, H., Florescu, D., Shasha, D., Simon, Warehousing and Knowledge Discovery (DaWaK E. & Saita, C. (2001). Declarative Data Cleaning: 2005), pp. 85–94, Copenhagen, Denmark, August Language, Model, and Algorithms. Proceedings of 22-26 2005 27th International Conference on Very Large Data Bases (VLDB 2001), pp.: 371-380, September 11-14, Davidson, S., & Kosky, A. (1999). Specifying 2001, Roma, Italy Database Transformations in WOL. Bulletin of the Technical Committee on Data Engineering, 22, 1, Gupta, A., & Mumick, I. (1995). Maintenance of 25-30. Materialized Views: Problems, Techniques, and Applications. IEEE Data Engineering Bulletin, Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). 18, 2, 3-18 Duplicate Record Detection: A Survey. IEEE Trans- actions on Knowledge and Data Engineering, 19, Gupta, H., & Mumick, I. (2006). Incremental main- 1, 1-16 tenance of aggregate and outerjoin expressions. Information Systems, 31, 6, 435-464 Fagin, R. (2007). Inverting schema mappings. ACM Transactions on Database Systems, 32, 4, IBM. WebSphere DataStage. Product’s web page 25:1-25:53 at http://www-306.ibm.com/software/data/integra- tion/datastage/Last accessed 21 July 2008. Fagin, R., Kolaitis, P., & Popa, L. (2005). Data exchange: getting to the core. ACM Transactions Informatica. PowerCenter. Product’s web page at on Database Systems, 30, 1, 174-210 http://www.informatica.com/products_services/ powercenter/Pages/index.aspxLast accessed 21 Fagin, R., Kolaitis, P., Miller, R., & Popa, L. (2005). July 2008. Data exchange: semantics and query answering. Theoretical Computer Science, 336, 1, 89-124 Karakasidis, A., Vassiliadis, P., & Pitoura, E. (2005). ETL Queues for Active Data Warehousing. In Proc.

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 25

2nd International Workshop on Information Qual- Algorithm. Proceedings of the 1986 ACM SIGMOD ity in Information Systems (IQIS 2005), co-located International Conference on Management of Data with ACM SIGMOD/PODS 2005, June 17, 2005, (SIGMOD 1998), pp: 53-60, Washington, D.C., Baltimore, MD, USA. May 28-30, 1986. Kimbal, R., Reeves, L., Ross, M., & Thornthwaite, W. Luján-Mora, S., Vassiliadis, P., & Trujillo, J. (2004). (1998). The Data Warehouse Lifecycle Toolkit: Expert Data Mapping Diagrams for Data Warehouse Design Methods for Designing, Developing, and Deploying with UML. In Proc. 23rd International Conference Data Warehouses. John Wiley & Sons. on Conceptual Modeling (ER 2004), pp. 191-204, Shanghai, China, 8-12 November 2004. Kimball, R., & Caserta, J. (2004). The Data Ware- house ETL Toolkit. Wiley Publishing, Inc. Luo, G., Naughton, J., Ellmann, C., & Watzke, M. (2006). Transaction Reordering and Grouping for Kotidis, Y. (2002). Aggregate View Management Continuous Data Loading. First International Work- in Data Warehouses. In Handbook of Massive Data shop on for the Real-Time Sets (pages 711-742). Kluwer Academic Publishers, Enterprises (BIRTE 2006), pp. 34-49. Seoul, Korea, ISBN 1-4020-0489-3 September 11, 2006 Koudas, N., & Srivastava, D. (2005). Approximate Microsoft. SQL Server Integration Services. Joins: Concepts and Techniques. Tutorial at the Product’s web page at http://www.microsoft.com/ 31st International Conference on Very Large Data sql/technologies/integration/default.mspxLast ac- Bases (VLDB 2005), pp.: 1363, Trondheim, Norway, cessed 21 July 2008. August 30 - September 2, 2005. Slides available at http://queens.db.toronto.edu/~koudas/docs/AJ.pdf Mumick, I., Quass, D., & Mumick, B. (1997) (last accessed 23 July 2008) Maintenance of Data Cubes and Summary Tables in a Warehouse. In Proceedings ACM SIGMOD Koudas, N., Sarawagi, S., & Srivastava, D. (2006). International Conference on Management of Data Record linkage: similarity measures and algorithms. (SIGMOD 1997), pp.: 100-111, May 13-15, 1997, Tutorial at the ACM SIGMOD International Con- Tucson, Arizona, USA ference on Management of Data (SIGMOD 2006), pp.:802-803, Chicago, Illinois, USA, June 27-29, Naumann, F., Bilke, A., Bleiholder, J., & Weis, 2006. Slides available at http://queens.db.toronto. M. (2006). Data Fusion in Three Steps: Resolving edu/~koudas/docs/aj.pdf (last accessed 23 July Schema, Tuple, and Value Inconsistencies. IEEE 2008) Data Engineering Bulletin, 29, 2, 21-31 Labio, W., & Garcia-Molina, H. (1996). Efficient Oracle. Oracle Warehouse Builder. Product’s web Snapshot Differential Algorithms for Data Warehous- page at http://www.oracle.com/technology/products/ ing. In Proceedings of 22nd International Conference warehouse/index.html Last accessed 21 July 2008. on Very Large Data Bases (VLDB 1996), pp. 63-74, September 3-6, 1996, Mumbai (Bombay), India Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Sim- itsis, A., & Frantzell, N. (2007). Supporting Stream- Labio, W., Wiener, J., Garcia-Molina, H., & Gore- ing Updates in an Active Data Warehouse. In Proc. lik, V. (2000). Efficient Resumption of Interrupted 23rd International Conference on Data Engineering Warehouse Loads. In Proceedings of the 2000 ACM (ICDE 2007), pp 476-485, Constantinople, Turkey, SIGMOD International Conference on Manage- April 16-20, 2007. ment of Data (SIGMOD 2000), pp. 46-57, Dallas, Texas, USA Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Sim- itsis, A., & Frantzell, N. (2008). Meshing Streaming Labio, W., Yerneni, R., & Garcia-Molina, H. (1999). Updates with Persistent Data in an Active Data Shrinking the Warehouse Update Window. Proceed- Warehouse. IEEE Transactions on Knowledge and ings ACM SIGMOD International Conference on Data Engineering, 20, 7, 976-991 Management of Data, (SIGMOD 1999), pp.: 383-394, June 1-3, 1999, Philadelphia, Pennsylvania, USA Raman, V., & Hellerstein, J. (2000). Potters Wheel: An Interactive Framework for Data Cleaning and Lindsay, B., Haas, L., Mohan, C., Pirahesh, H., & Transformation. Technical Report University of Wilms, P. (1986). A Snapshot Differential Refresh California at Berkeley, Computer Science Divi-

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 26 International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009

sion, 2000. Available at http://www.cs.berkeley. Simitsis, A., Vassiliadis, P., Terrovitis, M., & edu/~rshankar/papers/pwheel.pdf Skiadopoulos, S. (2005). Graph-Based Modeling of ETL activities with Multi-level Transformations Raman, V., & Hellerstein, J. (2001). Potter’s Wheel: and Updates. In Proc. 7th International Conference An Interactive Data Cleaning System. In Proceedings on Data Warehousing and Knowledge Discovery of 27th International Conference on Very Large Data 2005 (DaWaK 2005), pp. 43-52, 22-26 August 2005, Bases (VLDB 2001), pp. 381-390, Roma, Italy. Copenhagen, Denmark. Roth M., & Schwarz P. (1997). Don’t Scrap It, Wrap Sismanis, Y., Deligiannakis, A., Roussopoulos, It! A Wrapper Architecture for Legacy Data Sources. N., & Kotidis, Y. (2002). Dwarf: shrinking the In Proceedings of 23rd International Conference on PetaCube. roceedings of the 2002 ACM SIGMOD Very Large Data Bases (VLDB 1997), pp. 266-275, International Conference on Management of Data August 25-29, 1997, Athens, Greece (SIGMOD 2002), 464-475, Madison, Wisconsin, Roussopoulos, N. (1998). Materialized Views and June 3-6, 2002 Data Warehouses. SIGMOD Record, 27, 1, 21-26 Skoutas, D., & Simitsis, A., (2006). Designing ETL Roussopoulos, N., Kotidis, Y., & Roussopoulos, M. processes using technologies. In Pro- (1997). Cubetree: Organization of and Bulk Updates ceedings ACM 9th International Workshop on Data on the Data Cube. Proceedings ACM SIGMOD Warehousing and OLAP (DOLAP 2006), pp.:67-74, International Conferenceon Management of Data Arlington, Virginia, USA, November 10, 2006 (SIGMOD 1997), pp. 89-99, May 13-15, 1997, Skoutas, D., & Simitsis, A., (2007). Flexible and Tucson, Arizona, USA Customizable NL Representation of Requirements th Roussopoulos, N., Kotidis, Y., & Sismanis, Y. (1999). for ETL processes. In Proceedings 12 International The Active MultiSync Controller of the Cubetree Conference on Applications of Natural Language to Storage Organization. Proceedings ACM SIGMOD Information Systems (NLDB 2007), pp.: 433-439, International Conference on Management of Data Paris, France, June 27-29, 2007 (SIGMOD 1999), pp.: 582-583, June 1-3, 1999, Skoutas, D., & Simitsis, A., (2007). Ontology-Based Philadelphia, Pennsylvania, USA Conceptual Design of ETL Processes for Both Shu, N., Housel, B., Taylor, R., Ghosh, S., & Lum, V. Structured and Semi-Structured Data. Int. Journal (1977). EXPRESS: A , Processing, of Semantic Web Information Systems (IJSWIS) 3, and REStructuring System. ACM Transactions on 4, 1-24 Database Systems, 2, 2, 134-174. Stanoi, I., Agrawal, D., & El Abbadi, A. (1999). Simitsis, A. (2005). Mapping conceptual to logical Modeling and Maintaining Multi-View Data Ware- models for ETL processes. In Proceedings of ACM houses. Proceedings 18th International Conference 8th International Workshop on Data Warehousing on Conceptual Modeling (ER 1999), pp.: 161-175, and OLAP (DOLAP 2005), pp.: 67-76 Bremen, Paris, France, November, 15-18, 1999 Germany, November 4-5, 2005 Stöhr, T., Müller, R., & Rahm, E. (1999). An integra- Simitsis, A., & Vassiliadis, P. (2008). A Method tive and Uniform Model for Metadata Management for the Mapping of Conceptual Designs to Logical in Data Warehousing Environments. In Proc. Intl. Blueprints for ETL Processes. Decision Support Workshop on Design and Management of Data Systems, 45, 1, 22-40. Warehouses (DMDW 1999), pp. 12.1 – 12.16, Hei- delberg, Germany, (1999). Simitsis, A., Vassiliadis, P., & Sellis, T. (2005). Opti- mizing ETL Processes in Data Warehouses. Proceed- Thiele, M., Fischer, U., & Lehner, W. (2007). ings 21st Int. Conference on Data Engineering (ICDE Partition-based workload scheduling in living 2005), pp. 564-575, Tokyo, Japan, April 2005. data warehouse environments. In Proc. ACM 10th International Workshop on Data Warehousing and Simitsis, A., Vassiliadis, P., & Sellis, T. (2005). OLAP (DOLAP 2007), pp. 57-64, Lisbon, Portugal, State-Space Optimization of ETL Workflows.IEEE November 9, 2007. Transactions on Knowledge and Data Engineering, 17, 10, 1404-1419 Thomsen, C., Pedersen, T., & Lehner, W. (2008). RiTE: Providing On-Demand Data for Right-Time

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. International Journal of Data Warehousing & Mining, 5(3), 1-27, July-September 2009 27

Data Warehousing. Proceedings of the 24th Interna- Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. tional Conference on Data Engineering (ICDE 2008), (2002). Modeling ETL activities as graphs. In Proc. pp 456 – 465, April 7-12, 2008, Cancun, Mexico. 4th International Workshop on the Design and Man- agement of Data Warehouses (DMDW 2002), held in TPC. The TPC-DS benchmark. Transaction Process- conjunction with the 14th Conference on Advanced ing Council. Available at http://www.tpc.org/tpcds/ Information Systems Engineering (CAiSE’02), pp. default.asp (last accessed at 24 July 2008) 52-61, Toronto, Canada, May 27, 2002. TPC. The TPC-H benchmark. Transaction Process- Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. ing Council. Available at http://www.tpc.org/ (last (2002). Conceptual Modeling for ETL Processes. accessed at 24 July 2008) In Proc. ACM 5th International Workshop on Data Trujillo, J., & Luján-Mora, S. (2003). A UML Based Warehousing and OLAP (DOLAP 2002), McLean, Approach for Modeling ETL Processes in Data VA, USA November 8, 2002. nd Warehouses. In Proceedings of 22 International Vassiliadis, P., Simitsis, A., Georgantas, P., & Ter- Conference on Conceptual Modeling (ER 2003), pp. rovitis, M. (2003). A Framework for the Design of 307-320, Chicago, IL, USA, October 13-16, 2003. ETL Scenarios. In Proc. 15th Conference on Advanced Tziovara, V., Vassiliadis, P., & Simitsis, A. (2007). Information Systems Engineering (CAiSE 2003), Deciding the Physical Implementation of ETL Work- pp. 520- 535, Klagenfurt/Velden, Austria, 16 - 20 flows. Proceedings ACM 10th International Workshop June, 2003. on Data Warehousing and OLAP (DOLAP 2007), pp. Vassiliadis, P., Simitsis, A., Georgantas, P., Ter- 49-56, Lisbon, Portugal, 9 November 2007. rovitis, M., & Skiadopoulos, S. (2005). A generic Vassiliadis, P., & Simitsis, A. (2009). Extraction- and customizable framework for the design of ETL Transformation-Loading. In Encyclopedia of Da- scenarios. Information Systems, 30, 7, 492-525. tabase Systems, L. Liu, T.M. Özsu (eds), Springer, Vassiliadis, P., Simitsis, A., Terrovitis, M., & Skia- 2009. dopoulos, S. (2005). Blueprints for ETL workflows. Vassiliadis, P., Karagiannis, A., Tziovara, V., & In Proc. 24th International Conference on Con- Simitsis, A. (2007). Towards a Benchmark for ETL ceptual Modeling (ER 2005), pp. 385-400, 24-28 Workflows. 5th International Workshop on Quality October 2005, Klagenfurt, Austria. Long version in Databases (QDB 2007), held in conjunction with available at VLDB 2007, Vienna, Austria, 23 September 2007. http://www.cs.uoi.gr/~pvassil/publications/2005_ Vassiliadis, P., Quix, C., Vassiliou, Y., & Jarke, ER_AG/ETL_blueprints_long.pdf (last accessed at M. (2001). Data Warehouse Process Management. 25 July 2008) Information Systems, 26, 3, pp. 205-236

Panos Vassiliadis received the PhD degree from the National Technical University of Athens in 2000. Since 2002, he has been with the Department of Computer Science, University of Ioannina, Greece, where he is also a member of the Distributed Management of Data (DMOD) Laboratory (http://www.dmod.cs.uoi.gr). His research activity and published work concerns the area of data warehousing, with particular emphasis on metadata, OLAP, and ETL, as well as the areas of database evolution and web services. He is a member of the ACM, the IEEE, and the IEEE Computer Society.

Copyright © 2009, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.