<<

A COURSEWARE ON ETL PROCESS

Nithin Vijayendra B.E, Visveswaraiah Technological University, Karnataka, India, 2005

PROJECT

Submitted in partial satisfaction of the requirements for the degree of

MASTER OF SCIENCE

in

COMPUTER SCIENCE

at

CALIFORNIA STATE UNIVERSITY, SACRAMENTO

FALL 2010

A COURSEWARE ON ETL PROCESS

A Project

by

Nithin Vijayendra

Approved by:

______, Committee Chair Dr. Meiliu Lu

______, Second Reader Dr. Chung-E Wang

______Date

ii

Student: Nithin Vijayendra

I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the Project.

______, Graduate Coordinator ______Dr. Nik Faroughi Date

Department of

iii

Abstract

of

A COURSEWARE ON ETL PROCESS

by

Nithin Vijayendra

Extract, Transform and Load (ETL) is a fundamental process used to populate a warehouse. It involves extracting data from various sources, transforming the data according to business requirements and loading them into target data structures. Inside the transform phase a well-designed ETL system should also enforce , data consistency and conforms data so that data from various source systems can be integrated. Once the data is loaded into target systems in a presentation-ready format, the end users can run queries against them to generate reports which help them make better business decisions. Even though ETL process consumes roughly 70% of resources they are hardly visible to the end users [5].

The objective of this project is to create a website which contains courseware on

ETL process and a web based ETL tool. The website, containing the ETL courseware, can be accessed by anyone with internet access. This will be helpful to a wide range of audience from beginners to experienced users. The website is created using technologies

HTML, PHP, Korn shell scripts and MySQL.

iv

The ETL tool is web based and anyone with internet access can use this tool for free. However guests have limited access to this tool and registered users have complete access. Using this tool, data can be extracted from text files and MySQL tables, combined together and loaded into MySQL tables. Before loading the data into target

MySQL tables, various transformations according to business requirements, can be applied to them. This tool is developed using HTML, PHP, SQL and Korn shell scripts.

______, Committee Chair Dr. Meiliu Lu

______Date

v

ACKNOWLEDGMENTS

I am thankful to all the people who have helped and guided me through this journey of completing my Masters Project.

My sincere thanks to Dr. Meiliu Lu, for giving me an opportunity to work under her guidance on my masters project. She has been very supportive, encouraging and has guided me throughout the project. My heartfelt thanks to Prof. Chung-E Wang for being my second reader.

My special thanks to my friend Sreenivasan Natarajan for his patience reviewing my project report. I would also like to thank all my friends who have been there for me throughout my graduate program at California State University, Sacramento.

Last but not the least I would like to thank my parents, sister and relatives for their unconditional love, support and motivation.

vi

TABLE OF CONTENTS Page

Acknowledgements ...... vi

List of Tables ...... x

List of Figures ...... xi

Chapter

1. INTRODUCTION ...... 1

2. BACKGROUND ...... 4

2.1 Need for an ETL tool ...... 5

2.2 Scope of the project ...... 6

2.3 Technology related...... 7

3. ETL COURSEWARE ...... 11

3.1 ETL components ...... 14

3.2 Requirements ...... 14

3.2.1 Business requirements ...... 15

3.2.2 ...... 15

3.2.3 requirements ...... 16

3.2.4 Data latency requirements...... 16

3.2.5 Data archiving requirements ...... 17

3.3 Data profiling ...... 17

3.4 ...... 19

3.5 and integration ...... 23

vii

3.6 ...... 24

3.7 Data transformations ...... 26

3.7.1 generator operation ...... 27

3.7.2 Lookup operation ...... 27

3.7.3 Merge operation ...... 28

3.7.4 Aggregation operation ...... 29

3.7.5 Change capture operation ...... 29

3.7.6 Change apply operation ...... 30

3.7.7 Data type operation ...... 31

3.8 Data load ...... 32

3.8.1 Historic load ...... 32

3.8.2 Incremental load ...... 33

3.8.3 Loading dimension tables ...... 34

3.8.3.1 Type 1 Slowly Changing Dimension ...... 35

3.8.3.2 Type 2 Slowly Changing Dimension ...... 36

3.8.4 Loading fact tables ...... 40

3.9 Exception handling ...... 43

4. ETL TOOL ARCHITECTURE ...... 45

5. ETL TOOL IMPLEMENTATION ...... 50

5.1 Using the tool ...... 50

5.2 Extraction phase ...... 52

5.2.1 Text file as source ...... 53

viii

5.2.2 MySQL as source ...... 59

5.3 Transformation phase...... 61

5.3.1 Transformation for a single source ...... 61

5.3.2 Transformation for multiple sources ...... 67

5.4 Loading phase ...... 68

6. CONLUSION ...... 71

6.1 Future enhancements ...... 72

Bibliography ...... 73

ix

LIST OF TABLES

Page

Table 1 Before snapshot of Store Dimension Table for Type 1 SCD ...... 36

Table 2 After snapshot of Store Dimension Table for Type 1 SCD ...... 36

Table 3 Snapshot 1 of Store Dimension Table for Type 2 SCD (Method 1) ...... 38

Table 4 Snapshot 2 of Store Dimension Table for Type 2 SCD (Method 1) ...... 38

Table 5 Snapshot 3 of Store Dimension Table for Type 2 SCD (Method 1) ...... 38

Table 6 Snapshot 1 of Store Dimension Table for Type 2 SCD (Method 2) ...... 39

Table 7 Snapshot 2 of Store Dimension Table for Type 2 SCD (Method 2) ...... 39

Table 8 Snapshot 3 of Store Dimension Table for Type 2 SCD (Method 2) ...... 40

Table 9 Table structure to Store Usernames and Password ...... 51

Table 10 Type of Input Box Based on Data Type ...... 56

Table 11 Structure of INFORMATION_SCHEMA.COLUMNS Table ...... 63

x

LIST OF FIGURES Page

Figure 1 Overview of ETL Process ...... 2

Figure 2 Screenshot of Transformations Page ...... 6

Figure 3 Screenshot of Transformations and Load Page of ETL Tool ...... 7

Figure 4 ETL Process...... 13

Figure 5 Components of ETL ...... 14

Figure 6 OLTP source for ETL Process ...... 20

Figure 7 Delimited File Format ...... 21

Figure 8 Fixed-width File Format ...... 21

Figure 9 Overview of ...... 26

Figure 10 Lookup Operation ...... 27

Figure 11 Merge Operation ...... 28

Figure 12 ETL Tool Layers ...... 45

Figure 13 Layers and Components of ETL Tool ...... 47

Figure 14 Add or Delete Users ...... 51

Figure 15 Source Selection ...... 52

Figure 16 Screenshot of Define Page ...... 55

Figure 17 Flow for Landing Data Using Text Source ...... 58

Figure 18 Flow of Landing Data Using MySQL Table as Source ...... 60

Figure 19 Screenshot of Details Webpage ...... 60

Figure 20 Flow of Tranformation Phase ...... 61

xi

Figure 21 Screenshot Showing Various Transformations ...... 65

Figure 22 Flow of Landing Phase ...... 68

Figure 23 Screenshot Showing Transformations for Multiple Sources ...... 69

xii

1

Chapter 1

INTRODUCTION

According to [1], “A is a historical, subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process”. By Historical we , the data is continuously collected from various sources and loaded in the warehouse. The previously loaded data is not deleted for long period of time thus containing historical data in the warehouse. By Subject

Oriented we mean, data is grouped into specific business areas instead of the business as a whole. By Integrated we mean, collecting and merging data from various sources and these sources could be disparate in nature. By Time-variant we mean, that all data in the data warehouse is identified with a particular time period. By Non-volatile we mean, once data is loaded in the warehouse it is never deleted or overwritten; hence it is not expected to change over time.

Extract, Transform, Load (ETL) is the back-end process which involves collecting data from various sources, preparing the data according to business requirements and loading it in the data warehouse. Extraction is the process where data is extracted from various source systems and temporarily stored in database tables or files. Source systems could range from one to many in number, and similar or completely disparate in nature.

Once the extracted data is staged temporarily it should be checked for validity and

2 consistency using the data validation rules. Transformation is the process which involves application of business rules to source data before it's loaded into the data warehouse.

Flat files

Excel files

XML files Extract Transform Load Warehouse Mainframes

DB tables

ERP systems

Figure 1 Overview of ETL process

As can be seen in Figure 1, there can be several data sources that have characteristics which differ from each other. These data sources could be in a different geographic location; could be incompatible with the organization‟s data store; could be many in number; could be on different platforms like mainframes, UNIX or Windows; the availability of data from each source system may also vary. In Extraction phase, data needs to be extracted from various source systems and placed temporarily in or flat files called the landing zone [6].

3

In Transformation phase, the landed data is picked up, cleansed and transformed based on the business requirements. There can be one or many transformation operations that are applied to datasets, which could lead to a change in data value, change of data type or change of data structure by addition or deletion of data from it. The transformed data is loaded into databases tables and this area is called the staging area [6].

In Loading phase, which is the last step in ETL process, validated, integrated, cleansed, transformed and ready-to-load data from the staging area is loaded into the warehouse dimension and fact tables.

This report is organized into several chapters. Chapter 1 gives a brief introduction about data warehouse and the role of ETL process in data warehousing projects. Chapter

2 discusses background and detailed introduction to ETL process. It discusses about the need for an ETL tool, scope of the project and the related technology used to build the

ETL web tool. Chapter 3 discusses about the ETL courseware. The ETL courseware contains material which a new user, to ETL processes, must know in order to implement successful ETL projects. There are several components of ETL like requirements, data profiling, extraction, validation, integration and others which this chapter explains in detail. Chapter 4 gives an overview of architecture of ETL web tool created for this project. Chapter 5 discusses the implementation of ETL web tool in details along with snippets of important source code. Chapter 6 summarizes and concludes this report with a glimpse into future enhancements to the courseware and the tool.

4

Chapter 2

BACKGROUND

Existence of data warehouse dates back to the late 1980s when IBM researchers

Barry Devlin and Paul Murphy developed the "business data warehouse". It meant to provide an architectural model which focused on the flow of data from operational systems to decision support systems. The architecture consisted of operational data layer, data access layer, metadata layer and informational access layer. Operational data layer is the source for a data warehouse; data access layer is the interface between data access and informational access layer; metadata layer is the ; and informational access layer is the last layer which is used by business analysts to analyze and generate reports [8].

There are several approaches to populate a warehouse. The top down approach, by Bill Inmon [1], proposed populating the data warehouse first and then populating the data marts. The bottom up approach, by [3], proposed populating the data marts first and then populating the data warehouse. There is also the hybrid approach which is a combination of top down and bottom up approach [5]. An ETL tool is used in data access layer to extract data from source systems and load them into the warehouse or the mart irrespective of which approach is used.

5

2.1 Need for an ETL tool

Interested users who would like to learn more about ETL tools may not have access to it. This is because commercial ETL tools are expensive to buy for small or medium sized projects. They need expensive hardware to run on and needs specialists to configure them before a normal user can start using it. Most of them are not open source and web based. They also have short evaluation periods like 30 to 60 days. The ETL web tool created in this project helps overcome the above challenges. It is web based, accessible freely to anyone with internet access and very user-friendly. Beginning ETL developers can use this tool to get a feel of what an ETL tool does before they dive into understanding complex commercial ETL tools.

An ETL tool has many advantages over hand-coded ETL code. It helps in simpler, faster and cheaper development of ETL code. Technical people with broad business skills who are not professional programmers can use ETL tools effectively.

Many ETL tools generate metadata automatically at every step of the process thus enforcing consistent metadata throughout the process. They also have integrated metadata repositories which can be synchronized with other source systems, target systems and other tools. They deliver good performance with very large datasets by using parallelism concepts such as pipelining and parallelism [4]. They come with built-in connectors for most of source and target systems. Most of the ETL tools these days have built-in schedulers to run the ETL code at scheduled times.

6

2.2 Scope of the project

This project mainly focuses on creation of a website which contains courseware on ETL and a web based ETL tool.

The ETL courseware is available to everyone with an internet connection and is intended for a wide variety of audience from beginners to experienced professionals.

Initially it explains about basics about each phase- Extract, Transform and Load; later it delves into details of each phase. Below is a screenshot of a webpage from ETL courseware website.

Figure 2 Screenshot of ETL courseware page

The ETL tool is web based and accessible to anyone with internet access. It allows users to select data from two types of source systems. They can be Comma

Separated Value (CSV) flat files, MySQL tables or a combination both. Once data is

7 extracted, they can be merged together, data type based transformations applied to each and loaded to target MySQL target tables. Below is a screenshot of a webpage from ETL web tool.

Figure 3 Screenshot of transformations and load page of ETL tool

2.3 Technology related

This section discusses the various technologies used in developing the ETL web tool.

The ETL tool is composed of three layers: the client layer, the processing layer and the database layer which are described below along with the technologies used:

 Client Layer: Users use a web browser, like Microsoft Internet Explorer, to

access/control the ETL tool. They can specify the number of sources, the type of

source, transformation to be applied on the extracted data and the target MySQL

table connection details. This layer is built using PHP and HTML.

8

 Processing Layer: The processing layer collects the user input from the client

layer. It process user request in the background and displays success or error

messages to the user. This layer has MySQL and text file connectors which

connects to source and extract data, based on the user‟s input. Once the data is

extracted, it is staged in temporary MySQL tables or flat files so that

transformations can be applied to it without disturbing the source content. This

layer is built using PHP, Korn shell scripts, MySQL statements and scripts.

 Database Layer: The database layer has the target MySQL connector which

connect to the target MySQL table and inserts the transformed data. This layer is

built using Korn shell scripts, MySQL statements and scripts

MySQL 5.1

MySQL is an open source SQL database management system, developed, distributed, and supported by . It is used for mission-critical, heavy load production systems and delivers a very fast, multi-user, multi-threaded and robust

SQL database server [10].

Key Features

1. High performance for variety of workloads.

2. Connectors for C, ODBC, Java, PHP, Perl,.NET etc.

3. Wide range of supported platforms.

4. XML functions with XPath support

5. Partitioning

9

6. -based replication

7. Great documentation, community and commercial support [11].

PHP 4.3

PHP: Hypertext Preprocessor is a general-purpose that is designed for web development to produce dynamic web pages. PHP code is embedded into the HTML source document and this is interpreted by a web server installed with a

PHP processor module which generates the web page document.

Key Features

1. Persistent database connections

2. Good connection handling

3. Easy remote file handling

4. Better session handling

5. Good command line usage [12]

Korn Shell Scripts:

The Korn shell (ksh) is a Unix shell which was developed by David Korn. It is backwards-compatible with the Bourne shell and includes many features of the C shell

[13].

Key Features

1. Supports associative arrays and built-in floating point arithmetic.

2. Support pipes

10

3. Supports pattern matching

4. Exception handling

5. Multidimensional arrays

6. Sub-shells

7. Unicode support [14]

In addition to above features, PHP and MySQL is extensively used in California State

University, Sacramento campus.

In this chapter we discussed about background of ETL tools. We mentioned the need for ETL tool and several advantages of over hand-coded ETL. Scope of this project was discussed along with the related technologies that were used to create a web-based

ETL tool. Now that the user is familiar with the overview and motivation for this project we discuss more on the ETL courseware in Chapter 3.

11

Chapter 3

ETL COURSEWARE

ETL courseware can be read by anyone is who is interested to learn about ETL processes. This courseware is very useful to users who have no prior knowledge of ETL processes or about ETL tools. Professional ETL developers could also use this as reference. ETL courseware is freely available to everyone who has internet access and can be accessed at http://gaia.ecs.csus.edu/~web_etl/etl/.

The courseware starts with basics and then proceeds to advanced topics. Initially it introduces ETL process to readers. Then it discusses about the various ETL components. Each ETL component is explained in detail with an example and a figure for easier understanding. Important topics like requirements, data profiling, data extraction, data transformation and data loading are explained in depth.

After reading this courseware, readers should be able to define what ETL process is and what its various components are. They‟ll have a thorough understanding of what each component does and they‟ll be able to apply these concepts in ETL project implementations.

Users can better understand the ETL courseware by using the ETL tool implemented for this project. The ETL tool is simple to use, user-friendly and users can learn basics of an ETL tool before trying their hands at commercially available complex

ETL tools.

12

Based on my prior work experience and by referring the following books, I have compiled this courseware.

 W.H Inmon, "Building the Data Warehouse" Fourth Edition

 Jack E. Olson, "Data Quality: The Accuracy Dimension"

 Ralph Kimball, Margy Ross, "The Data Warehouse Toolkit: The

Complete Guide to " Second Edition

 Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger, "Mastering Data

Warehouse Design: Relational and Dimensional Techniques"

 Ralph Kimball, Joe Caserta, "The Data Warehouse ETL Toolkit: Practical

Techniques for Extracting, Cleaning, Conforming, and Delivering Data"

 Larissa T. Moss, Shaku Atre, "Business Intelligence Roadmap: The

Complete Project Lifecycle for Decision-Support Applications"

 Ralph Kimball, Margy Ross, "The Kimball Group Reader: Relentlessly

Practical Tools for Data Warehousing and Business Intelligence"

The following sections brief about ETL process and its components. For more details please refer the website at http://gaia.ecs.csus.edu/~web_etl/etl/

13

Extract, Transform and Load (ETL) is a fundamental process used to populate a data warehouse. It involves extracting data from various sources, validating it for accuracy, cleaning and making it consistent, transforming the data according to business requirements and loading them into target data warehouse. Inside the Transform phase a well-designed ETL system should also enforce data quality, data consistency and conforms data so that data from various source systems can be integrated. Once the data is loaded into target systems in a presentation-ready format, the end users can run queries against them to make better business decisions. Even though ETL process consumes roughly 70% of the resources they are hardly visible to the end users [5].

Figure 4 ETL process

14

3.1 ETL components

Below are the various components of a ETL process.

Figure 5 Components of ETL

This shows just one ETL flow which load one or multiple tables. Similarly in the background there can be multiple ETL flows loading tables in the same data warehouse.

3.2 Requirements

Just like designing any system requires understanding the requirements first, design of ETL system also should start with requirements analysis. All the known requirements and constraints affecting the ETL system have to be gathered at one place.

Based on the requirements, architectural decisions should be made at the beginning of the ETL project. Construction of ETL code should start only after architectural decisions are baselined. Change in architecture at a later point of implementation would result in implementing the entire system from the very beginning since they affect hardware, software, personnel and coding practices. Listed below are the major requirements.

15

3.2.1 Business requirements

Business requirements are the requirements of the ends users who use the data warehouse. Based on the populated information content the end users can make better informed decisions. Selection of source systems is directly dependent on the business needs. Interviewing the end users to gather business requirements not only sets an expectation as to what they can do with the data but also there exists a possibility of the

ETL team discovering additional capabilities in data sources that can expand end user‟s decision making capabilities.

3.2.2 Data profiling

Data profiling is the process of examining the available data present in the source systems and collecting about that data. The purpose of these statistics can be to

 Assess the risk involved in integrating data from various applications, including

the challenges of joins.

 Assess whether metadata accurately describes the actual values present in source

systems.

 Understanding data challenges early stages of the project would avoid delays and

cost.

Data profiling examines the quality, scope and context of the source data and enables the ETL team to build an effective ETL system. If source data is very clean and is well maintained before it arrives at the data warehouse then minimal transformation is required to load them into dimension and fact tables. If source data is dirty then most of

16 the ETL team‟s effort would be in transforming, cleaning and conforming the data.

Sometimes source data might be deeply flawed and would not be able to support business objectives. This in case the data warehouse project should be cancelled.

Data profiling gives the ETL team a clear picture of how much data cleaning processes should be in place to achieve end users requirements. This would also result in better estimates and timely completion of the project.

3.2.3 Data integration requirement

Data from transaction systems must be integrated before they arrive in the data warehouse. In data warehousing it takes the form of conforming dimensions and conforming facts.

Conformed dimensions contains common attributes from different databases so that drill across reports can be generated using these attributes. Conformed facts are common measures, like Key Performance Indicators (KPIs), from across different databases so that these numbers can be compared mathematically.

In ETL system, data integration is a separate step and it involves mandating common names of attributes, facts and common units of measurement.

3.2.4 Data Latency requirement

Data latency requirement from the end users specifies how quickly the data has to be delivered to them. This requirement has a huge effect on the architecture of the ETL system. The batch oriented ETL system can be sped up using efficient processing

17 algorithms, parallel processing and faster hardware but sometimes the end users require data on a real-time basis. This requires a conversion of ETL system from batch-oriented to real-time oriented.

3.2.5 Data archiving requirement

Archiving data after it‟s loaded into the data warehouse is a safer approach when data needs to be reprocessed or for auditing purposes.

3.3 Data profiling

Data profiling is the process of examining the available data present in the source systems and collecting statistics about that data. The purpose of these statistics can be to

 Assess the risk involved in integrating data from various applications, including

the challenges of joins.

 Assess whether metadata accurately describes the actual values present in source

systems.

 Understanding data challenges early stages of the project would avoid delays and

cost.

According to Jack Olson [2], data profiling employs analytic methods of looking at data for the purpose of developing a thorough understanding of the content, structure and quality of the data. A good data profiling system can process very large amounts of data, and with the skills of the analyst, uncover all sorts of issues that need to be addressed.

18

Data profiling examines the quality, scope and context of the source data and enables the ETL team to build an effective ETL system. If source data is very clean and is well maintained before it arrives at the data warehouse then minimal transformation is required to load them into dimension and fact tables. If source data is dirty then most of the ETL team‟s effort would be in transforming, cleaning and conforming the data.

Sometimes source data might be deeply flawed and would not be able to support business objectives. This in case the data warehouse project should be cancelled. Data profiling can be achieved using commercial tools or hand coded applications. Data profiling process reads the source data and generates a comprehensive report on-

 Data types of each field

 Natural keys

 Relationships between tables

 Data statistics like maximum values, minimum values, most occurred values,

number of occurrences of each values etc...

 Dates in non-date fields

 Data anomalies like junk values, values outside given range, missing values etc...

 Null values

Data profiling gives the ETL team a clear picture of how much data cleaning processes should be in place to achieve end users requirements. This would also result in better estimates and timely completion of the project.

19

3.4 Data extraction

This chapter focuses on extraction of data from the sources systems. Data extraction is the process of selecting, transporting and consolidating the data from source systems to the ETL environment.

As organizations grow they would like to add a lot of data sources to their existing data store. Each source systems have characteristics which differ from each other. These data sources could be in a different geographic location; could be incompatible with the organization‟s data store; could be many in number; could be on different platforms like mainframes, UNIX or Windows; the periodicity (daily, weekly, monthly basis etc.) of feeding the data to the warehouse could vary; the availability of data from each source system may also vary. These vast differences in source system characteristics make data integration challenging.

Data extracted from source systems are placed temporarily in databases or flat files and this area is called the landing zone. Described below are few sources which are commonly used by organizations to extract data.

OLTP systems:

OLTP stands for Online Systems. These are a class of systems which facilitate and manage transaction-oriented applications which require faster data insertions and retrievals. These systems store their daily transactional data in relational databases. These databases are normalized for faster insertion and retrieval queries.

20

To extract data from these systems, ODBC drivers or native database drivers are used.

The disadvantage of ODBC driver over native database drivers are that they require more steps, as shown in diagram below, and take more time. It‟s a better approach to always extract only the required data by using appropriate WHERE clause in the SELECT query.

Figure 6 OLTP source for ETL process

Flat files

A flat file is a operating system file which contains text or binary content, one record per line. In ETL projects we usually come across two formats of flat file. Both are described below.

21

 Delimited file format

In this file format, data is stored in separate lines as shown below. Each line is separated by a new line character and each field is separated by a delimiter. Field delimiters can be a comma, pipe or other characters but have to remain the same for all fields. Also each record could have a record delimiter. In this example the record delimiter is a semi-colon. There could also be a quote delimiter for each field. It contains a single quote or a double quote. The final delimiter, in this case an End-Of-File character, denotes the end of that flat file.

Figure 7 Delimited file format

 Fixed-width file format

In this file format, data is stored in separate lines similar to delimited file format however each field occupies a fixed width and each line are of same width. Field values which do not occupy the entire field length is filled with spaces. Each field can be identified by the start and end position. In the example below field 1 starts at position 1 and ends at position 4, field 2 starts at position 5 and ends at 11 and so on.

Figure 8 Fixed-width file format

22

Web log sources

Most of the internet companies have a log called web log which stores visitors information. These web logs record information posted or retrieved for that particular website by each user. There are several uses for this. One is to analyze users click pattern on their website and find out which webpage gets the most hits from which geographic location. Based on this information they can further analyze and improve their website to increase user traffic. In order to analyze web logs, they have to be extracted from various regions, transformed and stored in data warehouses.

ERP systems:

ERP stands for Enterprise Resource Planning and was designed to provide integrated enterprise solutions by integrating important business functions like inventory, human resources, sales, financials etc. Since ERP systems are massive and extremely complex it would take years to collect data in them according to business requirements.

This would be a valuable source for the ETL systems. Nowadays most of the ETL tools provide ERP connectors to fetch data from ERP systems.

FTP

FTP stands for File Transfer Protocol and is a standard protocol used over TCP/IP networks to transfer files between machines. When ETL tools don't have appropriate connectors/adapters to connect to the source system, FTP pull/push is used to fetch data into ETL environment.

23

FTP pull take place at the ETL end and is used when we are sure the source file will be available at a pre-determined time. Here FTP is initiated by the ETL tool at a scheduled time.

FTP push takes place at the source and is used when availability of the source file is unknown. Here FTP is initiated by the source system when the file is created.

3.5 Data validation and integration

This chapter focuses on integration and validation of extracted data. Data extracted from the source systems must be validated and integrated before proceeding to the next phase of ETL process.

Data validation: Data that has been extracted from different source systems, and landed in the landing zone, must be validated before integrating them. This phase makes sure the source data has been completed transferred to the ETL environment. Also, it makes sure the latest data in source systems have been extracted. It's important that the extraction process only extracts the latest business data else it would lead to duplicates in data warehouse if old data is reloaded.

There are several ways to check if complete source data is extracted:

 If flat files are extracted, make sure the record count in source and target match.

 For flat files, make sure the source and target checksum match.

 For data extracted in to tables, make sure record count in source and target match.

24

Data integration: Data that has been extracted from different source systems, and landed in the landing zone, must be integrated together after it has been validated. Also data that is similar should be integrated with each other. Care must be taken to start off the integration process only after all the required data sets are present in the landing zone since source systems can be located in different geographic locations and would have to be extracted at different time zones.

3.6 Data cleansing

This chapter focuses on cleansing the validated data. Data cleansing is a process that makes sure the quality of data in a warehouse is maintained. It is also defined as a process which helps to maintain complete, consistent, correct, unambiguous and accurate data [2]. These attributes define the quality of data and are explained below.

Complete: Complete data is defined for each instance without any NULL values in them and records are not lost in the information flow. For example, in Social

Security Administration office, an individual's record with no SSN would be incomplete.

Consistent: The definition, value and format of the data must be same throughout the warehouse. For example, California State University Sacramento is known by several names: Sac state, CSUS, Cal Univ. Sacramento etc. To make this consistent only one convention should be followed everywhere.

Correct: The value should be true and meaningful. For example, age cannot be negative. Another example is, if a pallet contains 4 items then the same should reflect in the warehouse.

25

Unambiguous: The data can have only one meaning. For example, there are several cities in the U.S by the name New Hope but there is only one city by that name in

Pennsylvania. This unambiguous data should be loaded in the warehouse for clarity.

Accurate: Accurate means that the data loaded in the warehouse should be complete, consistent, correct, unambiguous and must be derived or calculated with precision.

There are several data-quality checks which can be enforced.

1) Column check: This check ensures that incoming data contains expected values as per source system's perspective. Some of the column property checks are- checking for

NULL values in non-nullable columns, checking fields for unexpected lengths, checking numeric values which fall outside a range, checking fields which contains other values than what is expected etc...

2) Structure check: Column checks focuses on individual columns but structure check focuses on relationship between those columns. Structure check can be enforced by having proper primary keys and foreign keys so that they obey .

Structure checks also enforce parent-child relationships.

3) Data and value check: Data and value check can be simple or complex. Example for a simple check is if a customer is flies in business class then he'll get double number of points to his account than a customer who flies economy class. Example for a complex check is a customer cannot be in limited partnership with firm A and a member of board for directors in firm B.

26

3.7 Data transformations

This chapter focuses on transforming the cleansed data to load it into the warehouse. These transformations are applied based on the business requirements, data warehouse loading approach (top down or bottom up) and source-to-target mapping document.

A transformation operation takes input data, modifies it by applying one or more functions and returns the output data. This could lead to a change in data value, change of data type or change of data structure by addition or deletion of data from it. When multiple functions are applied, the intermediate data are called data sets.

After data undergoes transformation according to requirements it's a better approach to save it temporarily in a database before finally loading it into the warehouse.

This temporary area of storage is called as the staging zone. In the last step, which is the data load phase, when loading data from staging area in to warehouse if the load process fails then only the load process can be restarted avoiding going through cleansing and transformation processes again.

Figure 9 Overview of data transformation

Described below are a few transformation operations. For a comprehensive list of transformation operations please refer the online courseware.

27

3.7.1 Surrogate key generator operation:

A surrogate key is a number that uniquely identify a record in a database table and is different from a primary key. A surrogate key is not derived from the application data but the primary key is.

A surrogate key generator operation takes one input, adds a new column which contains surrogate key for each record and outputs the result dataset. For each input record, surrogate key is calculated based on 4 parameters - initial value, current value, increment value and final value. If it's the first record in the dataset then surrogate key generator assigns initial value to it. If it's not the first record then surrogate key generator adds the increment value to the current value and stores it in the record. Usually current value is stored in a flat file or in a database table.

3.7.2 Lookup operation

Lookup operation has an input dataset, reference data set and a output/final dataset as shown in Figure 10.

Figure 10 Lookup operation

28

Lookup operation fetches the fields specified by the user and looks for a match in the reference dataset and returns the joined records if they match else it returns NULL values in place of reference columns

3.7.3 Merge operations

Merge operations can have 1 or more input datasets but only 1 output dataset. It combines data from input datasets and results in the output. The criteria to use merge operation is the number of fields in all input datasets should be same and data type of all fields should match with each other. The result dataset will have the same number of fields and same data type as its input datasets. The number of records in output dataset would be the total of all input datasets together.

Figure 11 Merge operation

29

3.7.4 Aggregation operation

Aggregation operation takes a single input and produces a single output. It classifies records, from the input dataset, into groups and computes totals, minimum, maximum or other aggregate functions for each group outputting them on to output dataset. Fields that needs to be grouped, to be used for aggregate function calculation, must be specified by the user.

Below are a few examples of aggregate functions which commercial ETL tools provide nowadays:

 Maximum value

 Minimum value

 Mean

 Percentage coefficient of variation

 Sum of weights

 Sum

 Missing values count

 Range

 Variance

3.7.5 Change capture operation

Change capture operation takes two datasets as input, makes a record of differences and produces one output dataset. The input datasets are denoted as before and

30 after datasets. The output dataset contains records which represent changes made to the before data set to obtain the after data set. The compare is based on a set of key fields, records from the two data sets are assumed to be copies of one another if they have same values in these key columns. The output dataset has an extract column which denotes insert, delete, copy and edit. These terms are explained in Change apply operation section.

3.7.6 Change apply operation

Change apply operation can be used only after change capture operation. It takes the change data set, which is the resultant dataset from change capture operation, applies the encoded change operations to the before data set to compute an after data set. Below are the encoded change operations described:

Insert: The change record is copied to the output;

Delete: The value columns of the before and change records are compared. If the value columns are the same or if the Check Value Columns on Delete is specified as False, the change and before records are both discarded; no record is transferred to the output. If the value columns are not the same, the before record is copied to the output.

Edit: The change record is copied to the output; the before record is discarded.

Copy: The change record is discarded. The before record is copied to the output.

31

3.7.7 Data type operation

Data type operations change the data type, precision or format of the input dataset.

Data type conversions: Conversion from text to date, text to timestamp, text to number, date to timestamp, decimal to integer are few examples

Precision conversions: Changing the numeric precision say from decimal value 3.12345 to decimal value 3.11 is an example for this conversion.

Format conversions: Changing date and timestamp formats is one of the examples for this conversion. If input is 28/01/90, based on business requirement, this could be changed to 1990-Jan-28.

Compare operations: Compare operation takes two inputs and produces a single output.

It compares a column by column comparison of the records. This can be applied to numeric and alpha-numeric fields. The output dataset contains three columns. The first column is the result column which contains a code giving the result of the compare operation. The second column contains the columns of the first input link and the third column contains the columns of the second input link. The result column usually has numeric codes which denote if both the inputs are equal, first is empty value, second is empty value, first is greater or first is lesser.

32

3.8 Data load

This section focuses on the last step of ETL process which is loading validated, integrated, cleansed and transformed data into data warehouse.

As mentioned in the previous chapter, it's a better approach to stage the ready-to- load data in temporary tables in a database. When loading data from staging area in to warehouse if the load process fails then only the load process can be restarted avoiding going through cleansing and transformation processes again.

Basically there are two types of data load namely historic load and incremental load.

3.8.1 Historic load

A data warehouse contains historical data. Based on user requirements data in warehouse has to be retained for a particular duration of time. This duration could be anywhere from a single year to several decades.

When a data warehouse is created i.e., tables in them are created, it contains no records.

This is since planning for creation of a warehouse could take several months or years.

During this time there would be lot of data in OLTP systems which act as source systems for warehouse. Loading of this initial historical data into warehouse is initial historic load.

Sometimes it may so happen that when data is loaded regularly in to the warehouse ETL process might break and fixing it would take several hours to several days. During this fix time there will be data in OLTP systems. Loading this data is also a historic load.

33

3.8.2 Incremental load

Incremental load is the periodic load of data into warehouse. This process loads the most recent data from OLTP systems. This process run periodically till the end of warehouse's life. Incremental loads could run daily, weekly, fortnightly, monthly, quarterly, yearly or at a scheduled time. For every incremental load there is a load window within which the ETL load process should start and finish loading into target warehouse. After end of load window, business users will usually start querying and analyzing data in warehouse.

There are several operations involved in loading warehouse. Based on the type of table being loaded, fact or dimension, the appropriate operation is selected. Few of data load operations are described below:

Insert operation: This operation inserts data in to warehouse. If data already exists in table, this operation will fail. Hence target table should be checked ahead of time before executing this operation.

Update operation: This operation updates the existing records in warehouse. Unlike insert operation this doesn't fail if update records are not found.

Upsert operation: This operation first executes update operation and if that fails then it inserts records into warehouse.

Insert update operation: This operation first executes insert operation and if that fails then it updates existing records in warehouse. Insert update operation is preferred to upsert operation since it is more efficient [7].

34

Delete insert operation: This operation first executes delete operation and then inserts source records in to warehouse.

Bulk load operation: Bulk load is a utility provided by major ETL vendors these days which are faster and efficient in loading huge amounts (hundred millions) of data into warehouse [4][7].

3.8.3 Loading dimension tables

A table that stores business related attributes and provides the context for is called a dimension table. They are in demoralized form.

A Dimension table contains a surrogate key which is a meaningless incrementing integer value. Surrogate key values are generated and inserted by ETL process along with other dimension attributes. This is made the primary key for dimension table and is used to with records in the fact table. By definition a surrogate key is supposed to be meaningless. However it can be made meaningful by creating a surrogate key value by intelligently combing data from other attributes in dimension table. But this would lead to more ETL processing, maintenance and updates if the actual attributes, on which these keys are based, change.

In addition to surrogate key, dimension table also contains a natural key. Unlike surrogate key, natural key is derived from meaningful application data. Dimension table also consists of other attributes.

A slowly changing dimension table is a dimension table which has updates coming in for the existing records. According to user requirements older values in

35 dimension tables can be historically maintained or discarded. Based on this there are three types of slowly changing dimensions. Loading data into dimension table differs based on type of dimension table. Below is a detailed explanation of each type and how to load them.

3.8.3.1 Type 1 Slowly Changing Dimension

For an existing record in dimension table, if there is an update on any or all attributes from source systems, SCD Type 1 approach is to overwrite the existing record without saving old values. This approach is used when already inserted record in dimension table is incorrect and needs to be correct. Or when business users don't see any use in keep history of previous values.

If record doesn't exist, then a new surrogate key value is generated, appended to dimension attributes and inserted into the table.

If record does exist then the surrogate key of existing dimension record is fetched, appended to new source record, the old record is deleted and then the new record is inserted into dimension table.

Upsert, insert update or delete insert operations can be used in this scenario.

However care has to be taken to save existing surrogate key value when using delete insert operation.

If there are a large number of Type 1 changes, then the best way to implement is to prepare new dimension records in a new table. Then drop existing records in dimension table and use bulk load operation.

36

Surr_Key Store_id Store_city Store_state Store_country

384729478 37287 Sacramento California United States

Table 1 Before snapshot of Store dimension table for Type 1 SCD

Surr_Key Store_id Store_city Store_state Store_country

384729478 37287 Los Angeles California United States

Table 2 After snapshot of Store dimension table for Type 1 SCD

Note: In the new snapshot store_city field is updated without changing surrogate key value.

3.8.3.2 Type 2 Slowly Changing Dimension

In this approach, if there are any changes to dimension attributes for existing records in dimension table, the old values are preserved.

When exiting record needs to be changed, instead of over writing, a new record with a new surrogate key is generated and inserted into dimension table. This new surrogate key is used in fact table from that moment onwards. There is no need to change or update existing records in fact or dimension table.

Type 2 SCD requires a good change capture system in place to detect changes in source systems and then notify ETL system. Sometimes update notifications won't be propagated from source to ETL system. In this case ETL code is supposed to download the complete dimension and make a field by field, record by record comparison to detect updates. If dimension table has millions of records and has over 100 fields then this

37 would be a very time consuming process and ETL code can't complete within the specified load window. In order overcome this problem, CRC codes are associated with each record. Entire record is given as input to CRC function which calculates a unique long integer code. This integer code will change even if there is a change of a single character in input record. When CRC codes are associated with each record then only these codes are compared instead of field by field, record by record comparison.

To implement this in ETL code, there are two flows. One flow has the new record and the other flow has the old record from dimension table. For the new record, a new surrogate key is generated along with current flag or start and end date values. This is used in insert operation. For the old record, current flag or start and end date values are updated and is used in update operation.

There are 2 ways to implement SCD type 2:

1) Method 1: In the first method, a new flag column is added to the dimension table which indicates where the record is current or now. In the example below when a new record with surrogate key 384729479 is added, it's current flag is inserted with "Y" and the current flag field value of record with surrogate key 384729478 is made "N". The same can be applied for snapshot 3.

38

Snapshot 1 of a Store dimension table:

Current Surr_Key Store_id Store_city Store_state Store_country flag

384729478 37287 Sacramento California United States Y

Table 3 Snapshot 1 of Store dimension table for Type 2 SCD (Method 1)

Snapshot 2:

Current Surr_Key Store_id Store_city Store_state Store_country flag

384729478 37287 Sacramento California United States N

384729479 37287 Los Angeles California United States Y

Table 4 Snapshot 2 of Store dimension table for Type 2 SCD (Method 1)

Snapshot 3:

Current Surr_Key Store_id Store_city Store_state Store_country flag

384729478 37287 Sacramento California United States N

384729479 37287 Los Angeles California United States N

384729521 37287 Arlington Texas United States Y

Table 5 Snapshot 3 of Store dimension table for Type 2 SCD (Method 1)

39

2) Method 2: In the second method, two columns are added to dimension table namely start date and end date. Start date has the current date when the record was inserted and end date has either high date (which is 12/31/9999) or Current_date - 1 value. As you can see from the example below, in snapshot 2, when a new record with new surrogate key is inserted, its end date has high date and start date has current date. The previous record's end date would be updated with Current_date - 1. The same can be applied to snapshot 3.

To find the latest record, a query is run against dimension table where end_date =

'12/31/9999'. Using method 2

Snashot 1 of a Store dimension table:

Surr_Key Store_id Store_city Store_state Store_country Start date End date

384729478 37287 Sacramento California United States 10/1/2010 12/31/9999

Table 6 Snapshot 1 of Store dimension table for Type 2 SCD (Method 2)

Snapshot 2:

Surr_Key Store_id Store_city Store_state Store_country Start date End date

384729478 37287 Sacramento California United States 10/1/2010 10/20/2010

Los 384729479 37287 California United States 10/21/2010 12/31/9999 Angeles

Table 7 Snapshot 2 of Store dimension table for Type 2 SCD (Method 2)

40

Snapshot 3:

Surr_Key Store_id Store_city Store_state Store_country Start date End date

384729478 37287 Sacramento California United States 10/1/2010 10/20/2010

Los 384729479 37287 California United States 10/21/2010 11/01/2010 Angeles

384729521 37287 Arlington Texas United States 11/02/2010 12/31/9999

Table 8 Snapshot 3 of Store dimension table for Type 2 SCD (Method 2)

Please refer the online courseware for loading Type 3 SCD table.

3.8.4 Loading fact tables

Fact tables contain measurements or metrics of business processes of an organization. According to Ralph Kimball [5], "measurement is an amount determined by observation with an instrument or a scale".

Fact tables are defined by their grain. A grain represents the most atomic level by which facts are defined. For example, in a sales fact table, the grain could be an individual line item of a sales receipt.

Fact tables contain one or more measurements along with a set of which point to dimension tables. Dimension tables are built around fact tables to provide context to measurements present in fact tables. Just a fact table, without any dimension tables surrounding it, makes no business sense.

41

Each fact table has a primary key which is a field or a group of chosen fields.

Primary key of fact table should be defined carefully such that during the load process duplicates don't occur. There is this possibility of occurrence when insufficient attention is paid during fact table's design or unexpected values starts flowing in from source systems. When this happens then there is no way to differentiate those records apart. To avoid this it's a good approach to have a sequence included for each fact record insert.

To insert source records into fact table each natural dimension key should be replaced by latest surrogate key. Natural key value could always be found out from respective dimension table. Surrogate keys are looked up using the lookup operation defined in Transformations chapter.

Below are a few points, which can improve load performance, to keep in mind when loading a fact table:

1) Insert and update records should be loaded separated. This can be done by writing the temporary update and insert data into different datasets in staging area and creating two separate ETL flows to load these records into fact table. Many vendor ETL tools usually provide upsert and/or insert-update options. In this scenario, insert-update option works efficiently.

2) Avoid SQL insert statements and use bulk load utility, if available, to insert huge number of records efficiently thus improving performance of ETL code.

3) Load data in parallel if there are no dependencies between ETL flows. For example, if two tables being loaded do not have a parent-child relationship then ETL code to load

42 both could be started simultaneously. Many vendor ETL tools provide partitioning and pipelining mechanisms to load data in parallel. In partitioning mechanism, a huge dataset is partitioned and several processes are created to work on them in parallel. In pipelining mechanism, after a process finishes processing a part of a huge dataset, it passes the processed chunk to next stage in ETL code. These two mechanisms speed up ETL processing significantly.

4) Disable rollback logging for databases which house data warehouse tables. Rollback logging is best suited for OLTP applications which require recovery from uncommitted transaction failures but for OLAP applications this consumes extra memory and CPU cycles since all data is entered and managed by ETL process [4].

5) Temporarily disabling indexing for fact tables when loading data and enabling them when the load in complete is a great performance enhancer. Also another option is to drop unnecessary indexes and rebuild only required ones.

6) Partitioning very big fact tables improves user's query performance. A table and its index can by physically divided and stored on separate disks. By doing so a query that requires a month of data from a table having millions of records can directly fetch data from that particular physical disk without scanning other data [4].

Few of the above steps could be thought ahead of time and implemented thus saving time in redo code changes at a later stage in the project.

43

3.9 Exception handling

This chapter discusses about exception handling in ETL process. An exception in

ETL is defined as any abnormal termination, unacceptable event or incorrect data which stops the ETL flow and thus stopping data reaching data warehouse. Exception handling is the process of handling these exceptions without corrupting any existing committed data and terminating the ETL process gracefully. During ETL process exceptions occur either due to incorrect data or due to infrastructure issues.

Data related exceptions could be caused because of incorrect data formats, incorrect values or incomplete data from the source systems. These records needs to be captured as rejects either in a file or in a database table and should be corrected, reprocessed and inserted in the next ETL run.

Infrastructure exception could be caused because of hardware, network, database, operating systems or other software issues. In such scenarios when ETL jobs fail, care must be taken to make them restart able. Making the jobs restart able means that when

ETL jobs are restarted they don't insert duplicate data or abort due to existing data.

Exception handling should be handled in extraction, validation & integration, cleansing, transformations and data load phase in order to have a stable and efficient ETL code in place.

In this chapter we discussed about the courseware. Several components of ETL process which are important for every ETL project implementation is discussed. Any user who wishes to participate in implementation of ETL projects should know about these components and should include appropriate exception handling mechanisms in place for

44 successful implementations. In the next chapter the architecture of ETL web tool, created for this project, is discussed along with its various components and how they interact with each other.

45

Chapter 4

ETL TOOL ARCHITECTURE

This chapter describes the architecture of the tool implemented in this project.

Initially the various layers, their interaction with each other and how they integrate to form a system are discussed. Then the components of the tool are discussed in detail.

The ETL tool is composed of three layers: the client layer, the processing layer and the database layer.

Client layer Processing layer Database layer

MySQL MySQL Text files

Figure 12 ETL tool layers

The client layer is the one which is visible to the end user and is used by the user to interact with the system. In the processing layer, the processing of the user input takes places. This has source connectors for text file and MySQL database. The database layer has the target connector for MySql database and uses it to connect to target MySQL database in order to insert records.

46

Users use a web browser, like Microsoft Internet Explorer, to control the ETL tool. They can specify the number of sources, the type of source, transformation to be applied on the extracted data and the target MySQL table connection details.

The processing layer collects the user input from the client layer. It process user request in the background and displays success or error messages to the user. This layer has MySQL and text file connectors which connects to the source, to extract data, based on the user input. Once the data is extracted, it is staged in temporary MySQL tables or flat files so that transformations can be applied to it without disturbing the source content.

The database layer has the target MySQL connect which connect to the target

MySQL table and inserts the transformed data.

47

Figure 13 shows various components of the tool and how it is connected with the layers.

Figure 13 Layers and components of ETL tool

Users must use a web browser to access the tool and courseware. There are two types of users: Guest and Registered user. Guest user is one who is not a student, faculty or staff of California State University, Sacramento. Guest users don‟t need username or password to access the tool or courseware but have limited access to the tool. They can only select sample source text files or sample database tables and write only to sample target database tables. Option to enter database and file details are blocked to them. On

48 the other hand a registered user is one who is a student, faculty or staff of California State

University, Sacramento. Registered users require a username and password to login and get unrestricted access to the tool. They can specify custom absolute file paths and custom MySQL tables as source. These tables can be in any database located within

California State University, Sacramento campus and should be accessible via campus

LAN. They can also specify custom target MySQL table to load their source data.

When users, guest or registered, opens the homepage they have an option to either go to courseware or the tool. When they click on the tool link, first page that they see is the login page. When they don‟t have a username and password they can click on the guest link to access.

Once they are in the tool they come across extract, transform and load pages in the same order. When they are in extract page, they can select either text file or MySQL table as the source and enter the details like absolute path or the database details. When they click on continue, the input details are passed on to the processing layer as shown in

Figure 13.

The processing layer reads the input details and uses source connectors to validate the input data. If the user specifies text as source then the text source connector checks if the file exists and is readable before proceeding to the next step. If the user specifies

MySQL table then the MySQL source connector connects to the source database and makes sure it has table read privileges before proceeding further. Once the validation is successful the processing layer copies over the source data to landing zone. A landing zone is a temporary work area where the source data is landed in so that it can be

49 processed by the tool. The reason for having landing zone is that the source data remains untouched and the tool has complete access to manage and modify the files in the landing zone.

The next step is to define metadata for the selected source. If the source selected is text then the processing layer display a webpage to define metadata for individual columns. If the source is MySQL table then the processing layer automatically fetches the table‟s metadata from database dictionary.

Once metadata for source is defined the user chooses transformations which need to be applied on the extracted source data. Applying the transformations too is taken care by the processing layer with the help of PHP and Korn shell scripts. The transformed data is temporarily stored in temporary files or MySQL tables and this zone is called the staging zone.

Data from the staging zone is picked up by the loading layer and is loaded into target MySQL tables using the target connectors.

This chapter gave an overview of the architecture, layers and the components of the tool. It also discussed about the system design. Once the reader is familiar with design and architecture it helps him to better understand the detailed implementation of the tool in Chapter 5.

50

Chapter 5

ETL TOOL IMPLEMENTATION

This chapter describes the implementation of the ETL web tool. It discusses about the guest and registered user login process. Based on the type of user several features may or may be available. All options available in each phase are explained along with screenshots and examples for each option.

5.1 Using the tool

Users must use a web browser to use the ETL tool. There are two types of users- guest and registered user. A guest is anyone who is not a student, faculty or staff of

California State University, Sacramento. Guests don‟t need a username or password to use the tool. They click on the hyperlink “Guest? Click here”. When using the guest credentials they have limited access to source files, source tables and target tables. Four sample files along with their absolute path and four sample tables are mentioned in the tool online. Guests cannot enter source tables or target tables details as they are disabled in the tool.

A registered user is one who is a student, faculty or staff of California State

University, Sacramento and has been supplied username and password by the professor.

Registered users have unrestricted access unlike guest users. They can enter absolute path of the source files which they want to load. They can enter credentials like server name,

51 database name, table name, user name and password of a different MySQL source and target databases which can be anywhere on-campus and are available via college LAN.

New users can be added or existing users can be deleted by the professor by logging in as administrator. Username and password are stored in Users MySQL table in web_etl database. This database can be accessed only by the administrator. Users

MySQL table structure is given below.

Username Varchar(50) Primary Key

Password Varchar(50) Not null

Table 9 Table structure to store usernames and password

From Table 9, username is of varchar datatype of length 50 and is the primary key of the table. Password field is of varchar datatype of length 50 and has NOT NULL constraint added to it. Each username entered must be unique in this table.

Figure 14 shows the flow administrator has to follow in order to add or delete users from the system.

Courseware

Add users

Administrator ETL tool login

Delete users

Figure 14 Add or delete users

52

The administrator opens the homepage, clicks on the ETL tool link, then enters administrator username and password. Once inside the system, the administrator can add new users or delete existing users.

5.2 Extraction phase

Users must use a web browser to use the ETL tool. The first page displays the extraction phase of the tool. Here the user can select number of sources and the type of source. The minimum number of sources is one and the maximum is two. The type of source can be flat file and/or MySQL table. Below is the flow of how users can select text or MySQL table as source.

Courseware

Text source

User/Guest ETL tool login MySQL table source

Figure 15 Source selection

Users open homepage, choose ETL tool, login with their username and password or choose guest login and then they can choose the number of sources and their type.

Both the options are explained in detail below.

53

5.2.1 Text file as source

If user is a guest user then he/she has limited options. They can only choose from the list of sample files displayed on the webpage but if the user is a registered user then he/she can specify custom absolute path name.

If the user chooses text file as one of the source then he/she has to specify the absolute path of the file present on Linux. Below are the points which should be followed-

1. The file should exist and should have read permissions.

2. The file should be in ASCII format.

3. It should be comma delimited without a final delimiter.

4. It should not have any quote or double quote to separate the fields

5. It can have maximum of ten fields.

When the user enters the absolute path of the source file and clicks Next the following are checked in the validateSourceFile.ksh script:

1. Check if user has entered the file path in the input box. If yes proceed to step 2

else raise an error.

2. Check if the entered path is a file or directory. If yes proceed to step 3 else raise

an error.

3. Check if the file has read permissions. If yes proceed to step 4 else raise an error.

4. Check if the file size is equal to zero. If yes, it means that the file is empty, then

raise an error else proceed to defineMetaData. page to define the metadata of

the source file.

54

Below is the code snippet of validateSourceFile.ksh

## Check for number of arguments to this program ## 1st argument is the filename if [[ $# -ne 1 ]]; then echo "Error: Name of the source file not entered" exit 1; fi

## Assign 1st arguement to var sourceFile sourceFile=$1

## Check if var sourceFile is empty if [[ -z $sourceFile ]]; then echo "Error: Filename supplied is empty" exit 1; fi

## Check if var sourceFile is a directory if [[ -d $sourceFile ]]; then echo "Error: Source file path supplied is a directory. Please specify a file" exit 1; fi

## Check for file permissions if [[ ! ( -r $sourceFile ) ]]; then echo "Error: Source file path is incorrect or source file doesn't have read permissions" exit 1; fi

## Check for empty file if [[ ! ( -s $sourceFile ) ]]; then echo "Error: Source file path is incorrect or source file is empty" exit 1; fi

Once the source file is validated by validateSourceFile.ksh, the next step is to define the metadata in defineMetaData.php page. The ETL tool allows up to ten fields to be defined for a source text file. The metadata consists of the field name, its data type and its length. The defineMetaData.php page displays a snapshot of 10 lines of the source file

55 to the user so that they can refer it while defining the metadata. Below is a screenshot of web page which allows users to define metadata for source file.

Figure 16 Screenshot of Define Metadata page

The first field is the name field. This is an input box where users have to type in the field name. The fields in source have to be named in this page and the source file shouldn‟t contain field names as their first record.

The second field is the data type field. It is a drop down box and only one option can be selected. It has the following values:

 Varchar

 Char

 Integer

 Date

56

 Timestamp

 Decimal

The third field is an input box where the length of the field has to be specified by the user. This should contain only numeric values and should be greater than zero. This field is dynamic and can display length, precision or nothing based on the data type chosen by the user. Below is the table showing the data type chosen and the display that appears on the webpage.

Data type selected Display

Varchar Length input box

Char Length input box

Integer Length input box

Date No display

Timestamp No display

Precision and Scale input Decimal boxes

Table 10 Type of input box based on data type

Below is a snippet of HTML and JavaScript which dynamically changes the length field based the data type selected.

. . Column1: Name Length

58

Precision Scale

Once metadata is defined, temporary MySQL tables are created in web_etl database to hold the source file. Figure 17 shows the source flow.

Define Create temp Load source Text source metadata table data

Figure 17 Flow for landing data using text source

Internally in PHP source code, a custom CREATE TABLE SQL statement is prepared from the manually defined metadata and is run against the web_etl database. This creates a table similar to the source data. Then the source data is loaded into temporary table using the load scripts

59

5.2.2 MySQL table as source

If the user is a guest user then he/she has limited options. They can only choose from the list of sample tables displayed on the webpage but if the user is a registered user then they can specify details as described in the following sections.

If the user chooses a MySQL table as one of the source then the following details must be specified-

1. Name of the remote server. The server should be within the college network and

accessible.

2. Name of the database in the remote server.

3. Name of the table.

4. Username

5. Password

The script source_mysql.php accepts the above input and validates the following before proceeding further-

1. Checks the connection to the remote server

2. Validates the username and password

3. Checks if the database exists

4. Checks if the table exists.

If the above points are satisfied then the source table structure and data are captured by createTempTableMySql.ksh. The user doesn‟t have to manually define the metadata.

Below is the working of MySQL table source flow:

60

MySQL table Extract Create temp Load source source metadata table data

Figure 18 Flow of landing data using MySQL table as source

After MySQL source table details are entered, the PHP and Korn shell scripts extract the metadata automatically from the source table by connecting to that database. It also exports the source data into temporary files in landing zone. Then it creates temporary table in web_etl database and loads the landing zone data into the temporary table.

Below is a screenshot of web page which allows users to enter credentials for source

MySQL table.

Figure 19 Screenshot of database details webpage

61

5.3 Transformation phase

Users must select transformation, based on their business requirements, after they have selected the sources. There are several transformations available and are described below in detail. The flow is show in Figure 19.

Single source

Landing Staging zone Tranformations zone data data Multiple sources Figure 20 Flow of transformation phase

The transformations are based on number of sources selected during extract phase. If multiple sources are selected then merge with duplicates and merge without duplicates are the two options available. If a single source is selected then transformations are based on data type of the fields in input dataset. After the data is transformed, it is written into staging area ready to be loaded into target MySQL tables.

5.3.1 Transformation for a single source

When users choose single source, they have several transformations available to them based on the data type of each field. The data type of each field is fetched from database dictionary using the query 'SELECT COLUMN_NAME, DATA_TYPE FROM

INFORMATION_SCHEMA.COLUMNS WHERE table_name = „$tablename‟;

The table below shows the structure of COLUMNS table present in

INFORMATION_SCHEMA database [9]. Of all the fields present in this table only

62

COLUMN_NAME and DATA_TYPE fields are required to display the several transformations available to the user.

Field Type Null

TABLE_CATALOG varchar(512) YES

TABLE_SCHEMA varchar(64) NO

TABLE_NAME varchar(64) NO

COLUMN_NAME varchar(64) NO

ORDINAL_POSITION bigint(21) unsigned NO

COLUMN_DEFAULT longtext YES

IS_NULLABLE varchar(3) NO

DATA_TYPE varchar(64) NO

CHARACTER_MAXIMUM_LENGTH bigint(21) unsigned YES

CHARACTER_OCTET_LENGTH bigint(21) unsigned YES

NUMERIC_PRECISION bigint(21) unsigned YES

NUMERIC_SCALE bigint(21) unsigned YES

CHARACTER_SET_NAME varchar(32) YES

COLLATION_NAME varchar(32) YES

COLUMN_TYPE longtext NO

COLUMN_KEY varchar(3) NO

EXTRA varchar(27) NO

PRIVILEGES varchar(80) NO

63

COLUMN_COMMENT varchar(255) NO

Table 11 Structure of INFORMATION_SCHEMA.COLUMNS table

The different data types and the transformations available are described below.

 Integer Data type

No transformations available for this data type

 Varchar and Char Data type

Convert to lower case: This option converts the input string or characters to

lower case. Example: Input “ABC”; Output “abc”

Convert to upper case: This option converts the input string or characters to

upper case. Example: Input “Abc”; Output “ABC”

Remove leading spaces: This option removes leading spaces from the input

string or characters, if any. Example: Input “ abc ”; Output “abc ”

Remove trailing spaces: This option removes trailing spaces from the input

string or characters, if any. Example: Input “ abc ”; Output “ abc”

Remove leading and trailing spaces: This option removes leading and trailing

spaces from the input string or characters, if any. Example: Input “ abc ”;

Output “abc”

 Decimal type

64

Round: This option converts the input decimal to an approximately equal,

simpler and shorter representation. Example: Input 2.20 ; Output 2.00

Ceiling: This option converts the input decimal to its smaller integer value.

Example: Input 2.20 ; Output 3

Floor: This option converts the input decimal to its largest integer value.

Example: Input 2.20 ; Output 2

Absolute value: This option converts the input decimal to its absolute value.

Example: Input -2.20 ; Output 2.20

 Date and Timestamp data type

Get date: This option extracts the date part from date or timestamp input.

Example: Input “2010-01-04 14:09:02”; Output “2010-01-04”

Get day: This option extracts the day of the month from the date or timestamp

input. Example: Input “2010-01-04 14:09:02”; Output “3”

Get day of the week: This option returns the weekday index from the date or

timestamp input. Returns 1 for Sunday, 2 for Monday…. 7 for Saturday.

Example: Input “2010-10-18”; Output 2

Get month: This option returns the month from the date or timestamp input.

Example: Input “2010-10-18”; Output 10

Get name of the month: This option returns the month name from the date or

timestamp input. Example: Input “2010-10-18”; Output “October”

Get quarter: This option returns the quarter from the date or timestamp input.

The returned value is between 1 and 4. Example: Input “2010-02-18”; Output 1

65

Get year: This option returns the year from the date or timestamp input.

Example: Input “2010-02-18”; Output 2010

Below is a screenshot of web page shows the transformation page when a single source is selected.

Figure 21 Screenshot showing various transformations

Below is the flow of source code

Below is the code from one_source.php which displays dynamically the transformations available to user based on the data type of input fields. echo "

";

66

$i=1; $j=0; while($row = mysql_fetch_array($result)) { echo "

"; echo ""; echo ""; echo ""; if ($row[1] == 'decimal') { $j=$i+1; echo ""; echo ""; } else if ($row[1] == 'varchar' || $row[1] == 'char') { $j=$i+1; echo ""; echo ""; }

else if ($row[1] == 'timestamp' || $row[1] == 'date') { $j=$i+1; echo "

"; echo ""; } else { $j=$i+1; echo ""; } echo ""; $i=$i+2; } echo "
Column name Datatype Transformation
" . $row[0] . "" . $row[1] . "
";

5.3.2 Transformation for multiple sources

When users choose two sources, they have two options of transforming the input data. One option is to merge the two with duplicates and the other is to merge them without any duplicates. However the users should note that when they choose multiple sources to merge, the number of fields in both the sources should be the same and the data type of corresponding fields should also be the same.

Once they have selected the transformation, PHP and Korn shell scripts check if both the datasets are compatible with each other by comparing the number of columns and the data type of corresponding columns. If they match then are temporarily landed in the staging zone which is a temporary MySQL table. If they don‟t match then an error is displayed to the user.

68

5.4 Loading phase

After the user chooses transformations to be applied to the source data, he/she has to choose a MySQL table to load the source data. The transformed data is directly loaded from the staging zone to target MySQL table as show in Figure 21.

Target Staging zone MySQL table Figure 22 Flow of loading phase

If the target table is in the same server and database as staging zone table then data is directly copied over to the target table. If the target table is located on a different database than staging zone table then data from staging zone is exported to temporary files and then loaded into target table using the export scripts.

The MySQL table should be defined and should exist in the target database.

User has to specify the following details -

 Name of the remote server. The server should be within the college network and

accessible.

 Name of the database in the remote server.

 Name of the table.

 Username

 Password

The script transform_source.php accepts the above input and validates the following before proceeding further-

1. Checks the connection to the remote server

69

2. Validates the username and password

3. Checks if the database exists

4. Checks if the table exists.

5. Checks if the user has insert permissions.

If the above checks are satisfied then transform_source.php fetches the data from staging zone and inserts the transformed data in the target table.

Below is a screenshot of web page shows the transformation and load page when a multiple sources are selected.

Figure 23 Screenshot showing transformations for multiple sources

70

Below is the code snippet from transform_source.php which generates the custom SQL to insert the data into target table.

$ROWS = $_POST[ROWS]; $ROWS=$ROWS*2; $i=1; while ( $i <= $ROWS) { $a = $_POST[$i]; $j=$i+1; $b = $_POST[$j];

if ($i == ($ROWS-1)) { $c = $c . $b ."(" . $a.")"; } else { $c = $c . $b ."(" . $a."),"; } $i=$i+2; }

$ = "insert into $table select $c from table";

This chapter discussed about implementation of the ETL web tool. It discussed options available for each phase in detail along with screenshot, description on how to use, several options available, an example for each option and important source code snippets to help understand the internal working of the tool for the user. Chapter 6 is the final chapter which gives an overall summary along with possible future enhancement to the courseware and tool.

71

Chapter 6

CONCLUSION

In this project ETL courseware and ETL tool implementation were discussed.

ETL courseware discusses important aspects, from initially requirements gathering to final error handling processes, which are important for ETL developers to know in order to implement ETL projects successfully. ETL tool implemented in this project can extract from multiple heterogeneous sources, combine them together, apply various transformations and load the transformed data into target database tables. The ETL tool is implemented using PHP 4.3, Korn Shell scripts and MySQL 5.1.4 scripts.

As a conclusion to this report, we can say that this project has accomplished its primary goals and objectives as discussed in scope section 2.2 of Chapter 2. The main objective of the ETL courseware is to first introduce basic concepts to familiarize the interested audience about ETL and then discuss advanced topics like cleansing, transformations, dimension and fact table loads. The main objective of the ETL tool is to provide free access to interested audience to learn what an ETL is and how it works. The tool provides interactive graphical user interface and is user-friendly. It can extract from heterogeneous sources, land them in landing zone, apply various types of transformations, stage them in staging zone and finally load the transformed data into database tables. The heterogeneous sources can be flat files or MySQL tables. There are several transformations available to apply on the landed source data. The source and

72 target database must be MySQL database and must be connected via LAN within

California State University, Sacramento campus.

This project has helped me understand the basics of ETL tool‟s internal working.

It also helped me learn new languages like PHP and Korn shell scripting and am thankful that I got an opportunity to work on MySQL database.

6.1 Future enhancements

There a few limitations in this project which can be worked upon in the future to enhance the ETL tool and courseware. The first limitation is limited number of source and target connectors. Currently the tool has flat file and MySQL table connectors.

Oracle database, MS SQL server database and XML file connectors could be added. The second enhancement would be to add more transformations like SCD, change-capture and change apply stage to the ETL tool.

73

BIBLIOGRAPHY

[1] W.H Inmon, "Building the Data Warehouse" Fourth Edition

[2] Jack E. Olson, "Data Quality: The Accuracy Dimension"

[3] Ralph Kimball, Margy Ross, "The Data Warehouse Toolkit: The Complete Guide to

Dimensional Modeling" Second Edition

[4] Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger, "Mastering Data Warehouse

Design: Relational and Dimensional Techniques"

[5] Ralph Kimball, Joe Caserta, "The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data"

[6] Larissa T. Moss, Shaku Atre, "Business Intelligence Roadmap: The Complete Project

Lifecycle for Decision-Support Applications"

[7] Ralph Kimball, Margy Ross, "The Kimball Group Reader: Relentlessly Practical

Tools for Data Warehousing and Business Intelligence"

[8] , General Information about Data Warehouse, [Online].

Available: http://en.wikipedia.org/wiki/Data_warehouse

[9] MySQL, The INFORMATION_SCHEMA COLUMNS Table, [Online]

Available: http://dev.mysql.com/doc/refman/5.0/en/columns-table.html

[10] MySQL, Overview of MySQL, [Online]

Available: http://dev.mysql.com/doc/refman/5.1/en/what-is-mysql.html

[11] MySQL, What Is New in MySQL 5.1, [Online]

Available: http://dev.mysql.com/doc/refman/5.1/en/mysql-nutshell.html

74

[12] PHP, PHP features, [Online]

Available: http://php.net/manual/en/features.php

[13] Wikipedia, Korn Shell, [Online]

Available: http://en.wikipedia.org/wiki/Korn_shell

[14] Wikipedia, Comparison of Computer shells, [Online]

Available: http://en.wikipedia.org/wiki/Comparison_of_computer_shells