Automatic Generation of Data Merging Program Codes

AUTOMATIC GENERATION OF DATA MERGING PROGRAM CODES Hyeonsook Kim, Samia Oussena, Ying Zhang Model Driven Research Centre, Thames Valley University, Saint Mary’s Road, London, U.K. Tony Clark Department of Computing, Middlesex University, London, U.K. Keywords: Data Merging Meta-model, Data Integration, Model Driven Engineering, Model Driven Data Integration, Automatic Model Transformation, Automatic Program Code Generation. Abstract: Data merging is an essential part of ETL (Extract-Transform-Load) processes to build a data warehouse system. To avoid rewheeling merging techniques, we propose a Data Merging Meta-model (DMM) and its transformation into executable program codes in the manner of model driven engineering. DMM allows defining relationships of different model entities and their merging types in conceptual level and our formalized transformation described using ATL (ATLAS Transformation Language) enables automatic generation of PL/SQL packages to execute data merging in commercial ETL tools. With this approach data warehouse engineers can be relieved from burden of repetitive complex script coding and pain of maintaining consistency of design and implementation. 1 INTRODUCTION Applying Model Driven Engineering (MDE) to ETL processes is one of the attractive approaches. A Data Warehouse (DW) is a collection of The approach reduces the complexity of ETL design integrated subject-oriented data bases designated to by decoupling data and meta-data and improves support the decision making process (Kimball et al, communication between domain experts and 2002). Building a DW involves processes that developers by using graphical model design. It also combine data with various formats and present a increases productivity due to the reduced amount of unified view of the data, as well as extract data from handcrafted coding and of rework for maintenance. different sources and cleanse inappropriate data. DW This is achieved by defining an abstracted model has become very popular choice for many enterprise first and transforming it into program codes. Thus, systems such as business intelligence and more of ETL working codes can be derived and maintained enterprise systems data needed to be added to the from well defined ETL models which are described data warehouse. To support the growing demands of in abstracted level and gradually mapped into DW development, ETL (Extract-Transform-Load) concrete level. processes have supported a systematic framework A number of these MDE approaches have been for the extraction of the data from heterogeneous proposed either as a UML extension or as their own data sources, and its transformation; cleansing, graphical notation for conceptual ETL data mapping converting, and loading them into the data design (Mora1 et al, 2004), (March et al, 2007). A warehouse. According to (March et al, 2007), ETL meta-model for process has also been proposed to processes are not only important for design and apply MDE to workflow and scheduling in DW maintenance of DW but also key factors for the (Bohm et al, 2008). Muñoz et al have proposed not success of DW projects. In this context, various just a design model but a whole conceptual data approaches have been proposed in order to improve integration framework (Muñoz et al, 2009). the ETL engineering. However most of these works address the whole ETL process and do not consider the problems 179 Kim H., Oussena S., zhang Y. and Clark T. (2010). AUTOMATIC GENERATION OF DATA MERGING PROGRAM CODES. In Proceedings of the 5th International Conference on Software and Data Technologies, pages 179-186 DOI: 10.5220/0003008301790186 Copyright c SciTePress ICSOFT 2010 - 5th International Conference on Software and Data Technologies which need to be addressed in each DW building 2000). Through each data process, data sources are phase. Furthermore, they have rarely demonstrated gradually reformatted and moved into target how to integrate industrial standards in their schemas. The processes can be easily executed and approaches. The details of previous research works maintained by controlling data from models within a are described in section 6. model driven approach. In this paper, we mainly focus on model driven In this section, we introduce general model data merging approach to address problems in data driven approach with two representative methods merging domain. Based on a real case study of a and discuss our own approach which is implemented DW development project, we propose a data utilising the general approach. merging system to generate executable merging codes from conceptual design. A Data Merging 2.1 General Model Driven Approach Meta-model (DMM) was proposed for design of merging models at conceptual level. Common Model Driven Engineering (MDE) is a software Warehouse Meta-model (CWM), an industrial engineering methodology uses models as primary standard for data warehouse modeling, was also artefacts that drive the whole development process used for design of merging models at physical level through model transformations. Over the years (CWM, 2008). The proposed system provides model based development has gained rapidly transformation of DMM into CWM. By using the increasing popularity across various engineering standard, it allows not to be bound to a particular disciplines. The representative two approaches are tool but allows the use of any DW development presented in this section. environment. Through this system, data warehouse engineers can develop a unified data schema by 2.1.1 Model Driven Architecture creating abstractions that help them program in terms of their design intent rather than the Model Driven Architecture (MDA) is the first underlying computing environment. The executable initiative of MDE which uses UML as modeling data merging codes can be obtained from CWM language and OCL (Object Constraint Language) merging models since ETL tool vendors provide and QVT (Query/View/Transformation) as model code generation from CWM. transformation language (OCL, 2008), (Kleppe et al, The rest of this paper is structured as follows. 2003). It is launched by the Object Management Section 2 presents model driven data warehousing Group (OMG) in 2001 and mainly focuses on providing both the general approach and ours. The forward engineering like producing code from proposed data integration framework and merging abstract and human-elaborated modeling diagrams meta-model were described in the section as well. separating design from architecture. Section 3 shows our implementation works MDA uses the Platform Independent Model illustrating the system architecture, target meta- (PIM) which represents a conceptual design to model; CWM and transformation rules. A case study realize the functional requirements. PIM is translated to which we applied the proposed model driven into one or more Platform Specific Models (PSMs) approach is introduced in Section 4. Finally related that computer can run. Accordingly, model works are given in section 5 and conclusions in transformations which support conversion between section 6. PIM and PSM are particularly important for the realization of MDA. Most software development IDEs support MDA 2 MODEL DRIVEN DATA by providing UML modeling and code generation from the UML models but there are many critics that INTEGRATION UML is too generic to describe domain specific problems. Another direction is to develop domain The whole data warehousing processes can be specific languages designed to solve common model divided into four phases; (1) analyzing and transformation tasks. Indeed, this approach has been understanding data in the different data sources, (2) widely taken recently by the research community preparing and collecting data into staging area; and software industry. As a result, a number of usually one physical platform, (3) combining data model transformation languages have been proposed through data cleansing, merging, and transformation, (Marcos et al, 2006), (Greenfield, 2004). which covers most ETL processes, (4) and finally, customizing data into different presentation according to application purposes (Rahm et al, 180 AUTOMATIC GENERATION OF DATA MERGING PROGRAM CODES 2.1.2 Eclipse Modeling Framework reverse transformation since existing data sources have their own schema or structure by which PSM is Eclipse is one of the most popular IDEs. It provides drafted. Extraction PIMs are usually designed on the convenient pluggable architecture. It also provides a basis of Data Source PIMs analysis and transformed meta-meta-model called ecore and its own modeling into PSMs and program codes in turn later. Merging framework for MDE (Dave et al, 2008). This PIMs are commonly designed after building data framework generates the model development cleansing strategy and then transformed into PSMs environment automatically. Developers can design and merging execution codes. Based on unified data their own models and transform them into target model, Customized PIMs are also built in order to models once a specific domain model is designed as present data in different way. a meta-model based on ecore. In addition to this, there are several open source 2.2.2 Data Merging plug-ins that facilitate model driven development based on Eclipse modeling framework with various In this paper, we concentrate on model driven data functionality. For example, ATLAS Model Weaver merging. Data merging in data warehousing includes (AMW) extends eclipse modeling framework for combining and moving data into target schema as model to model

Load more