Extract Transform Load (ETL) Process in Distributed Database Academic Data Warehouse
Total Page:16
File Type:pdf, Size:1020Kb
APTIKOM Journal on Computer Science and Information Technologies Vol. 4, No. 2, 2019, pp. 61~68 ISSN: 2528-2417 DOI: 10.11591/APTIKOM.J. SIT.36 61 Extract transform load (ET ) process in distributed database academic data warehouse Ardhian Agung Yulianto Industrial Engineering Department, Engineering .acult0, Andalas 1ni2ersit0 Kampus 3imau Manis Kota Padang, 4est Sumatera, Indonesia, 25163 Corresponding author, e-mail: ardhian.a06eng.unand.ac.id Abstract While a data warehouse is desi ned to support the decision-makin function, the most time-consumin part is the Extract Transform Load )ETL) process. Case in Academic Data Warehouse, when data source came from the faculty.s distributed database, althou h havin a typical database but become not easier to inte rate. This paper presents how to an ETL process in distributed database academic data warehouse. 0ollowin Data 0low Thread process in the data sta in area, a deep analysis performed for identifyin all tables in each data sources, includin content profilin . Then the cleanin , confirmin , and data delivery steps pour the different data source into the data warehouse )DW). 1ince DW development usin bottom-up 2imball.s multidimensional approach, we found the three types of extraction activities from data source table3 mer e, mer e-union, and union. Result for cleanin and conformin step set by creatin conform dimension on data source analysis, refinement, and hierarchy structure. The final of the ETL step is loadin it into inte ratin dimension and fact tables by a eneration of a surro ate key. Those processes are runnin radually from each distributed database data sources until it incorporated. This technical activity in distributed database ETL process enerally can be adopted widely in other industries which desi ner must have advance knowled e to structure and content of data source. Keywords3 data warehouse; distributed database; ETL; multidimensional Copyright © 2019 APTIKOM - All rights reser ed. 1. Introduction A simple definition of E8tract, Transform, and 3oad 9ET3: could be —the set of processes for getting data from O3TP 9On-3ine Transaction Processing: s0stems into a data warehouse.“ ET3 process as a part of Data Staging Area, as a first stage of recei2ing data deri2ed from heterogeneous of data sources, assuring data qualit0, and consistenc0 [1A. In a data warehouse 9D4:, the ET3 process is the most time- consuming part, e2en mentioned that the number could rise to 80B of the total proCect de2elopment time [2A. As the important component of database-oriented decision support s0stem, D4 stores the records about acti2ities and e2ents related to a distinct business process. D4 itself, defined as —a s0stem that e8tracts, clean, conforms, and deli2ers source data into a dimensional data store and then supports and implements quer0ing and anal0sis for decision maDing“[3A. The main components of D4 are operational data sources, data staging area, data presentation area, and data access tools [4A. Operational data sources support different business area, different technologies, and formats for staging data. In this research, as a case stud0 is the Academic Information S0stem at Andalas 1ni2ersit0, Padang, Indonesia. Andalas 1ni2ersit0 consists of 15 faculties, and each facult0 has se2eral departments with a total of 120 departments. This information s0stem pro2ides all ser2ices in the academic process started from course registration, lecturer distribution for e2er0 course, grading s0stem, and graduation. 4hen initiating this information s0stem, the database administrator designed it as a distributed database s0stem. Eased on Distributed Database 9DDE: definition said that when a DDE is deplo0ed using onl0 one computer, it remains a DDE because it is still possible to deplo0 it across multiple computers [5A. The general terminolog0 of DDE is the collection of multiple, separate DEMSs, each running on a separate computer, and utiliFing some communication facilit0 to coordinate their acti2ities in pro2iding shared access to the enterprise data [6A. The ad2antages of this configuration are as a reflection of organiFational structure, impro2ed local autonom0, a2ailabilit0, and performance. Although in another side also there are disad2antages such as lacD of standards, comple8it0, and integrit0 control [7A. Received May 15, 2019; Revised June 10, 2019; Accepted June 30, 2019 62 ISSN: 2528-2417 E0 this design, the database ser2er is installed on a single machine and consisted of se2eral separate academic databases for facult0. It means that e2er0 facult0 has its single t0pical database with identic structures 9table and attribute:, but there is no guarantee for the same data of the reference tables. This information s0stem is running well as a transactional s0stem and for preparing the enhanced academic business intelligent, we propose a construction of Academic D4 include integration data process. .igure 1. The main components of a proposed academic data warehouse. The ET3 acti2ities are represented in data staging area The .igure 1 is essential for an academic data warehouse to ha2e the truthful and comprehensi2e integral data. The first obCecti2e of this stud0 is the implementation of bottom-up dimensional data warehouse design approach. This s0stem initiated b0 the completeness of the business requirements pro2ided b0 the uni2ersit0 e8ecuti2es, then through the dimensional and fact tables with measurement data as a representati2e of multidimensional technique. The other obCecti2e in this research is to identif0 related attributes from each data source then identif0 them as the master tables or operational tables. In another hand, se2eral acti2ities performed in the ET3 process such as e8tracting, cleaning, conforming tables from and loading them into D4. This process carried on data qualit0 and the data anal0sis from different facult0Hs academic database data sources. 2. Research Method On distributed database data sources, the ET3 process becomes more complicated because there is no single control mechanism o2er the contents of the data of each database. The gradual and careful process to identif0 which table should be merged or not, to profile the tableHs attribute and anal0Fe data in tables are implemented in detail in this worD. Knowing the data structure will help in cleaning and conforming dimension and maDing fact table. In the case of an academic data warehouse, all steps arrange based on the business processI grain declared, dimension and data measurement in fact tables as the steps for the building data warehouse. In the building of the data warehouse, we adapted KimballHs bottom-up approach in de2eloping a multidimensional model [8A. Eased on KimballHs four-step dimensional design process, the model design acti2ities are built b0 the following steps: a. Select the business processes: to be modeled, taDing into account business requirements and a2ailable data sources. b. Declare the grain: defining an indi2idual fact table representation which related to business process measurements. c. hoose the dimensions: determining the sets of attributes describing fact table measurements. T0pical dimensions are time and date. Identif0 the numeric facts: define what measure to include in fact tables. .acts must be consistent with the declared grain The important ad2antage of this approach is consists of data marts as a representation of the business unit, which ha2e a 2er0 quicD initial set up and effortless to integrate between another data marts. .or consideration to enterprise scale D4, this method tranquil in e8tend from e8isting D4 and also pro2ide reporting capabilit0. After that, in the ET3 process, there are prominent four staging steps while data is staged 9written to the disD: in parallel with the data being transferred to the ne8t stage, as describe in .igure 2. APTIKOM J. CSIT Vol. 4, No. 2, 2019 : 61 G 68 APTIKOM J. CSIT ISSN: 2528-2417 63 .igure 2. .our steps data flow thread in data staging area as etl acti2ities [3A Research about ET3 performance alread0 carried out b0 researchers. Igor MeDterro2ic et al. [9A proposed a robust ET3 procedure with three agents: changer, cleaner, and loader. This research focuses to handle horiFontal data segmentation and 2ersioning when the same fact tables used b0 2arious users and must not influence the other. Vishal Gour et al. [10A proposed another impro2ement ET3 performance b0 quer0 cache technique to reduce the response time of ET3. E. Tute et al. [11A describe an approach for modeling of ET3-processes to ser2e as a Dind of interface between regular IT-management, D4, and users in the case in the clinical data warehouse. Abid Ahmad et al. [12A notif0 the use of the distributed database in de2elopment of D4. This stud0 e8plains more on architecture and change detection component. Sonali V0as et al. [13A define the 2arious ET3 process and testing techniques to facilitate for selection of the best ET3 techniques and testing. E8amining the mentioned stud0, it is clear the research gap in this ET3 process is ad2ance anal0Fing technique of distributed database data source. Also, the specific case in the academic area was applied with the strateg0 in each ET3 steps after anal0Fing the academic data. 3. Results and Discussion 3.1. Designing Multidimensional Model .ollow KimballHs four-step dimensional design for building an academic data warehouse shows in Table 1. Table 1. .our steps data modeling process for academic data warehouse Step Output 1. Identif0 the business Eusiness Process process Anal0sis of New Student