An ETL Metadata Model for Data Warehousing
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Computing and Information Technology - CIT 20, 2012, 2, 95–111 95 doi:10.2498/cit.1002046 An ETL Metadata Model for Data Warehousing Nayem Rahman1, Jessica Marz1 and Shameem Akhter2 1 Intel Corporation, USA 2 Western Oregon University, USA Metadata is essential for understanding information An enterprise data warehouse (EDW) gets data stored in data warehouses. It helps increase levels of from different heterogeneous sources. Since adoption and usage of data warehouse data by knowledge workers and decision makers. A metadata model is operational data source and target data ware- important to the implementation of a data warehouse; the house reside in separate places, a continuous lack of a metadata model can lead to quality concerns flow of data from source to target is critical to about the data warehouse. A highly successful data ware- house implementation depends on consistent metadata. maintain data freshness in the data warehouse. This article proposes adoption of an ETL (extract- Information about the data-journey from source transform-load) metadata model for the data warehouse to target needs to be tracked in terms of load that makes subject area refreshes metadata-driven, loads observation timestamps and other useful parameters, and timestamps and other load parameters for the minimizes consumption of database systems resources. sake of data consistency and integrity. This The ETL metadata model provides developers with a set information is captured in a metadata model. of ETL development tools and delivers a user-friendly batch cycle refresh monitoring tool for the production Given the increased frequency of data ware- support team. house refresh cycles, the increased importance of data warehouse in business organization, and Keywords: ETL metadata, metadata model, data ware- the increasing complexity of data warehouses, house, EDW, observation timestamp a centralized management of metadata is essen- tial for data warehouse administration, mainte- nance and usage [33]. From the standpoint of a data warehouse refresh process, metadata sup- 1. Introduction port is crucial to data warehouse maintenance team such as ETL developers, database admin- istrators, and the production support team. The data warehouse is a collection of deci- sion support technologies, aimed at enabling An efficient, flexible, robust, and state of the art the knowledge worker (executive, manager, and data warehousing architecture requires a num- analyst) to make better and faster decisions ber of technical advances [36]. A metadata [5]. A data warehouse is defined as a “subject- model-driven cycle refresh is one such impor- oriented, integrated, non-volatile and time vari- tant advancement. Metadata is essential in data ant collection of data in support of manage- warehouse environments since it enables activ- ment’s decisions” [17]. It is considered as a key ities such as data integration, data transforma- platform for the integrated management of deci- tion, on-line analytical processing (OLAP) and sion support data in organizations [31]. One of data mining [10]. Lately, in data warehouses, the primary goals in building data warehouses batch cycles run several times a day to load data is to improve information quality in order to from operational data source to the data ware- achieve certain business objectives such as com- house. A metadata model could be used for dif- petitive advantage or enhanced decision making ferent purposes such as extract-transform-load, capabilities [2, 3]. cycle runs, and cycle refresh monitoring. 96 An ETL Metadata Model for Data Warehousing Metadata has been identified as one of the key The model is also designed to provide the pro- success factors of data warehousing projects duction support team with a user-friendly tool. [34]. It captures information about data ware- This allows them to monitor the cycle refresh house data that is useful to business users and and look for issues relating to a job failure of back-end support personnel. Metadata helps a table and load discrepancy in the error and data warehouse end users to understand the var- message log table. The metadata model pro- ious types of information resources available vides the capability to setup subsequent cycle from a data warehouse environment [11]. Meta- run behavior followed by the one-time full re- data enables decision makers to measure data fresh. This works towards enabling tables to be quality [30]. The empirical evidence from the truly metadata-driven. The model also provides study suggests that end-user metadata quality developers with a set of objects to perform ETL has a positive impact on end-user attitude about development work. This enables them to fol- data warehouse data quality [11]. Metadata is low standards in ETL development across the important not only from end user perspective enterprise data warehouse. standpoint, but also from the standpoint of data In the metadata model architecture, the load jobs acquisition, transformation, load and the analy- are skipped when source data has not changed. sis of warehouse data [38]. It is essential in de- Metadata provides information to decide whe- signing, building, maintaining data warehouses. ther to run full or delta load stored procedures. In a data warehouse there are two main kinds It also has the capability to force a full load if of metadata to be collected: business (or log- needed. The model also controls collect statis- ical) metadata and technical (aka, ETL) meta- tics jobs running them after a certain interval or data [38]. The ETL metadata is linked to the on an on-demand basis, which helps minimize back-end processes that extract, transform, and resource utilization. The metadata model has load the data [30]. The ETL metadata is most several audit tables to archive critical metadata often used by the technical analysts for devel- for three months to two years or more. opment and maintenance of the data warehouse [18]. In this article, we will focus mainly on In Section 2 we discuss related work. In Section ETL metadata that is critical for ETL develop- 3 we give a detailed description of an ETL meta- data model and its essence. In Section 4, we ment, batch cycle refreshes, and maintenance of cover metadata-driven batch processing, batch a data warehouse. cycle flow, and an algorithm for wrapper stored In data warehouses, data from external sources procedures. The main contribution of this work is first loaded into staging subject areas. Then, is presented in Sections 3 and 4. In Section 5 analytical subject area tables – built in such a we discuss use of metadata in data warehouse way that they fulfill the needs of reports – are subject area refreshes. We conclude in Sec- refreshed for use by report users. These tables tion 6 by summarizing the contribution made are refreshed multiple times a day by pulling by this work, providing a review of the meta- data from staging area tables. However, not data model’s benefits and proposing the future all tables in data warehouses get changed data works. during every cycle refresh: the more frequently the batch cycle runs, the lower the percentage of tables that gets changed in any given cycle. 2. Literature Research Refreshing all tables without first checking for source data changes causes unnecessary loads Research in the field of data warehousing is fo- at the expense of resource usage of database cused on data warehouse design issues [13, 15, systems. The development of a metadata model 7, and 25], ETL tools [20, 32, 27, and 19],data that enables some utility stored procedures to warehouse maintenance [24, 12], performance identify source data changes means that load optimization or relational view materialization jobs can be run only when needed. By control- [37, 1, and 23] and implementation issues [8, ling batch job runs, the metadata model is also 29]. Limited research has been done on the designed to minimize use of database systems metadata model aspects of data warehousing. resources. The model makes analytical subject Golfarelli et al. [14] provide a model for mul- area loads metadata-driven. tidimensional data which is based on business An ETL Metadata Model for Data Warehousing 97 aspects of OLAP data. Huynh et al. [16] pro- Under our ETL metadata model, platform inde- pose the use of metadata to map between object- pendent database specific utility tools are used oriented and relational environment within the to load the tables from external sources to the metadata layer of an object-oriented data ware- staging areas of the data warehouse. The pro- house. Eder et al. [9] propose the COMET posed metadata model also enables database model that registers all changes to the schema specific software, such as stored procedures, and structure of data warehouses. They con- to perform transformation and load analytical sider the COMET model as the basis for OLAP subject areas of the data warehouse. The intent tools and transformation operations with the of the software is not to compete with exist- goal to reduce incorrect OLAP results. Stohr ing ETL tools. Instead, we focus on utilizing et al. [33] have introduced a model which uses the capabilities of current commercial database a uniform representation approach based on the engines (given their enormous power to do com- Uniform Modeling Language (UML) to inte- plex transformation and their scalability) while grate technical and semantic metadata and their using this metadata model. We first present the interdependencies. Katic et al. [21] propose a ETL metadata model followed by detailed de- model that covers the security-relevant aspects scriptions of each table. We also provide exper- ( ) of existing OLAP/ data warehouse solutions. imental results viaTable:2to6inSection5 They assert that this particular aspect of meta- based on our application of the metadata model data has seen rather little interest from product in a real-world, production data warehouse.