Simpleetl: ETL Processing by Simple Specifications
Total Page:16
File Type:pdf, Size:1020Kb
Aalborg Universitet SimpleETL ETL Processing by Simple Specifications Andersen, Ove; Thomsen, Christian; Torp, Kristian Published in: Proceedings of the 20th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data co-located with 10th EDBT/ICDT Joint Conference Creative Commons License CC BY-NC-ND 4.0 Publication date: 2018 Document Version Publisher's PDF, also known as Version of record Link to publication from Aalborg University Citation for published version (APA): Andersen, O., Thomsen, C., & Torp, K. (2018). SimpleETL: ETL Processing by Simple Specifications. In Proceedings of the 20th International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data co-located with 10th EDBT/ICDT Joint Conference (Vol. 2062). CEUR Workshop Proceedings. CEUR Workshop Proceedings Vol. 2062 http://ceur-ws.org/Vol-2062/paper10.pdf General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. ? Users may download and print one copy of any publication from the public portal for the purpose of private study or research. ? You may not further distribute the material or use it for any profit-making activity or commercial gain ? You may freely distribute the URL identifying the publication in the public portal ? Take down policy If you believe that this document breaches copyright please contact us at [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from vbn.aau.dk on: October 05, 2021 SimpleETL: ETL Processing by Simple Specifications∗ Ove Andersen Christian Thomsen Kristian Torp Aalborg University & FlexDanmark Aalborg University Aalborg University Denmark Denmark Denmark [email protected] [email protected] [email protected] [email protected] ABSTRACT Massive quantities of data are today collected from many sources. However, it is often labor-intensive to handle and integrate these data sources into a data warehouse. Further, the complexity is increased when specific requirements exist. One such new re- quirement, is the right to be forgotten where an organization upon request must delete all data about an individual. Another require- ment is when facts are updated retrospectively. In this paper, we present the general framework SimpleETL which is currently Figure 1: Example Case Star Schema used for Extract-Transform-Load (ETL) processing in a company with such requirements. SimpleETL automatically handles all database interactions such as creating fact tables, dimensions, taxi company are stored. Each travel is a fact stored in a fact table, and foreign keys. The framework also has features for handling connected with a vehicle, a customer, and a date dimension. It is version management of facts and implements four different meth- common practice that facts are deleted, e.g., if it is discovered that ods for handling deleted facts. The framework enables, e.g., data an ordered trip two days ago was not executed anyway then the scientists, to program complete and complex ETL solutions very fact will be removed, or a facts gets updates, due to late arriving efficiently with only few lines of code, which is demonstrated accounting information. Further, for audit reasons, it is required with a real-world example. that changes must be tracked, e.g., if a price is updated. The presented SimpleETL framework enables data scientists to program an ETL solution in a very efficient and convenient 1 INTRODUCTION way with only few lines of code mainly with specifications of Data is being collected at unprecedented speed partly due to metadata. The framework manages everything behind the scene cheaper sensor technology and inexpensive communication. from structuring data warehouse schema, fact tables, dimensions, Companies have realized that detailed data is valuable because references, indexes, and data version tracking. This also includes it can provide up-to-date and accurate information on how the handling of changes to facts in line with Kimball’s slowly changing business is doing. These changes have in recent year coined dimensions [12]. Processing data using SimpleETL is automati- terms such as “Big Data”, “The five V’s”, and “Data Scientist”. It cally highly parallelized such that every dimension is handled in is, however, not enough to collect data; it should also be possible its own process and fact table processing is spread across multiple 1 for the data scientist to integrate it with existing data and to processes. analyze it. The rest of the paper is structured as follows: First related A data warehouse is often used for storing large quantity of work is discussed in Section 2. Then a simple use-case is intro- data possibly integrated from many sources. A wide range of duced in Section 3 followed by an example implementation in Extract-Transform-Load (ETL) tools support cleaning, structur- Section 4 showing how a user efficiently programs an ETL flow. ing, and integration of data. The available ETL tools offer many In Section 5, the support for fact version management and dele- advanced features, which make them very powerful but also both tion of facts is described. Then in Section 6 it is described how overwhelming and sometimes rigid in their use. It can thus be a data scientist configures and initializes an ETL run including challenging for a data scientist to quickly add a new data source. how the framework operates along with a real-world use case Further, many of these products mainly focus on data processing example. Section 7 concludes the paper and points to directions and less on aspects such as database schema handling. Other for future work. important topics are privacy and anonymity concerns of citizens, which has caused the EU (and others) to introduce regulations 2 RELATED WORK where citizens have a right to be forgotten [9]. Violating these A survey of ETL processes and technologies is given by [16]. regulations can lead to large penalties and it is thus important to A plethora of ETL tools exist from commercial vendors such enable easy removal of an individual citizen’s data from a data as IBM, Informatica, Microsoft, Oracle, and SAP [2–5, 7]. Open warehouse. source ETL tools also exist such as Pentaho Data Integration A simplified real-world example use case is presented bya and Talend [6, 8]. Gartner presents the widely used tools in its star-schema in Figure 1, where passenger travels carried out by a Magic Quadrant [10]. With most ETL tools, the user designs the ∗Produces the permission block, and copyright information ETL flow in a graphical user-interface by means of connecting 1 By “data scientist” we in this paper refer to someone focused at analyzing data boxes (representing transformations or operations) with arrows and less in the technical aspects of DBMSs, e.g., ETL tools and Data Warehousing. (representing data flows). © 2018 Copyright held by the owner/author(s). Published in the Workshop Another approach is taken for the tool pygrametl [14] for Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted which it is argued that programmatic ETL, i.e., creating ETL pro- under the terms of the Creative Commons license CC-by-nc-nd 4.0. grams by writing code, can be beneficial. With pygrametl, the user programs Python objects for dimension and fact tables to handle insert/update operations on the target data warehouse. SimpleETL, however, hides complexity from the user and con- veniently handles all schema management. Based on the speci- fication of metadata, SimpleETL creates 1) the required SQLto generate or alter the target data warehouse schema; 2) the neces- sary database actions and pygrametl objects to modify the tables; and 3) processes for parallel execution. SimpleETL provides tem- plate code for its supported functionality, e.g., history tracking of changing facts. It is therefore simple and fast for a data scien- tist to define an ETL flow or add new sources and dimensions, Figure 2: UML Class Diagram for SimpleETL because she does not have to make the code for this, but only specify the metadata. Tomingas et al. [15] propose an approach where Apache Ve- Listing 1: Defining Year Data Type locity templates and user-specified mappings are used and trans- 1 from simpleetl import Datatype 2 def _validate_year(val): formed into SQL statements. In contrast, SimpleETL is based on 3 if str(val).isnumeric() and 1900 <= int(intval) <= 2099: Python, which makes it easy for data scientists to exploit their 4 return int(intval) 5 return -1 existing knowledge and to use third party libraries. 6 yeartype = Datatype('smallint', _validate_year) BIAccelerator [13] is another template-based approach for cre- ating ETL flows with Microsoft’s SSIS [4], enabling properties that shows all components in the framework. Next, each compo- to be defined as parameters at runtime. Business Intelligence nent is described in more details by from the use case in Section 3. Markup Language (Biml) [1] is a domain-specific XML-based language to define SSIS packages (as well as SQL scripts, OLAP 4.1 Class Diagram cube definitions and more). The focus of BIAccelerator and Bim- l/BimlScript is to let the user define templates generating SSIS Figure 2 shows the UML class diagram for the SimpleETL frame- packages for repetitive tasks while SimpleETL makes it easy work that consists of three classes. The class Datatype is used to to create and load a data warehouse based on the templating define the type of a column in both dimension tables and(mea- provided by the framework.