File Compression and Decompression in Cloveretl

MASARYK UNIVERSITY FACULTY OF INFORMATICS Û¡¢£¤¥¦§¨ª«¬Æ°±²³´µ·¸¹º»¼½¾¿Ý File Compression and Decompression in CloverETL BACHELOR’S THESIS Sebastián Lazo ˇn Brno, spring 2014 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Sebastián Lazoˇn Advisor: doc. RNDr. Tomáš Pitner, Ph.D. i Acknowledgement I would like to express my thanks to the Javlin’s employees, especially Mgr. Jan Sedláˇcekfor their time, provided assistance and feedback on problems throughout the process of project development. I would also like to thank doc. RNDr. Tomáš Pitner, Ph.D. for valuable advice on the thesis’s text. ii Abstract The aim of the thesis was to create set of components for compression, decompression and manipulation with compressed file archives for CloverETL. The thesis provides an overview of ETL processes and introduction to CloverETL, implemented archive formats and used external libraries in the first part, while design, implementation and testing of the developed components in the second. iii Keywords Java, CloverETL, compression, decompression, ZIP, TAR, GZIP iv Contents 1 Introduction ...............................1 1.1 Motivation . .1 1.2 Purpose . .1 1.3 Structure . .2 2 ETL ....................................3 2.1 In general . .3 2.1.1 Extract . .3 2.1.2 Transform . .3 2.1.3 Load . .4 2.2 CloverETL . .4 2.2.1 Transformation graph . .4 2.2.1.1 Components . .5 2.2.1.2 Edges . .5 2.2.1.3 Sequences . .6 2.2.1.4 Lookup tables . .6 3 Data compression ............................7 3.1 In general . .7 3.1.1 Lossy . .7 3.1.2 Lossless . .7 3.1.2.1 ZIP . .8 3.1.2.2 TAR . .9 3.1.2.3 GZIP . .9 3.1.2.4 The DEFLATE algorithm . 10 3.2 In Java . 10 3.2.1 Java.util.zip package . 10 3.2.2 Apache Commons CompressTM ............ 11 3.2.3 TrueZip . 11 4 Analysis ................................. 13 4.1 File operation components . 13 4.1.1 Common attributes of Compressed and File Operation 13 4.2 Requirements . 14 4.2.1 Supported URIs . 14 5 Design .................................. 15 5.1 Project architecture . 15 5.1.1 Components . 15 5.1.2 CompressedFileManager . 16 5.1.3 CompressedOperationHandler . 17 v 5.1.3.1 Resolve . 17 5.1.3.2 List . 18 5.1.3.3 Delete . 18 5.1.3.4 Copy/Move . 18 5.1.3.5 Compress . 18 5.1.3.6 Decompress . 18 5.1.3.7 Other methods . 19 5.1.4 ArchiveInfo . 19 5.1.5 CompressedUtil . 19 5.2 Components attributes . 19 5.2.1 Input mapping . 19 5.2.2 Output mapping . 20 5.2.3 Error mapping . 21 6 Implementation ............................. 22 6.1 Used external libraries . 22 6.2 Integration with CloverETL . 22 6.2.1 Integration with Engine . 22 6.2.2 Integration with Designer . 23 7 Testing and documentation ...................... 24 7.1 Graph tests . 24 7.2 Unit tests . 24 7.3 Documentation . 24 8 Conclusion ................................ 25 8.1 Further extension of functionality . 25 8.2 Summary . 25 vi 1 Introduction Information are essential part of every enterprise whether as a subject of business or after analysis, by providing look at its functioning and help with its management. At first, these information saved in enterprise systems have to be extracted and processed. But when we realize these data can be stored in different repositories, platforms and applications, we find out a specialized tool is needed. ETL, shorthand for extract, transform, load represents tools which provide this functionality. A properly designed ETL system extracts data from the source systems, enforces data quality and consistency, conforms data so that separate sources can be used together, and finally delivers data in presentation-ready format.[2] 1.1 Motivation Javlin’s CloverETL is one of these tools. CloverETL represents group of multi-platform Java-based software products implementing ETL processes. It currently supports reading and writing of compressed data, but it has not been able to access and manipulate content of a compressed archive yet. During data writing, it is often more efficient to create files uncompressed and them compress them simultaneously. 1.2 Purpose Purpose of this thesis is to create set of components for compression, decompression and manipulation with compressed file archives for CloverETL. Components’ interface has to be similar to the existing ones from category FileOperation which are working with uncompressed files. Their fu- ture extension with new compression formats should be as simple as possible. New component category CompressedFileOperations consists of these components: ListCompressedFiles provides content listing of archives DeleteCompressedFiles removes entries from archives CopyCompressedFiles copies entries from one archive to another MoveCompressedFiles moves entries from one archive to another CompressFiles creates new archive or adds files to existing 1 1. INTRODUCTION DecompressFiles decompresses archive entries to selected location User and developer documentation are also part of the thesis. 1.3 Structure Thesis is divided to eight chapters. The second is dedicated to introduction to field of ETL tools, presentation of CloverETL and explaining basic princi- ples of how it works. The third talks about compression methods and algo- rithms in general, compression support in Java and functionality provided by external libraries. Following parts contain description of analysis and design of the components, implementation, testing and writing documentation. In the conclusion, options for further development of the created components’ functionality are presented. 2 2 ETL 2.1 In general The term ETL is essential part of data warehousing 1 and represents process of data extraction from data source, transformation to fit operational needs and loading to target location. After the data are collected from multiple sources (extraction), they are reformatted and cleansed for operational needs (transformation). Most of numerous extraction and transformation tools also enable loading of the data into the target location, mostly database, data warehouse or a data mart to be analyzed and allow developers to create applications and support users decisions.[2] Except for data warehousing and business intelligence, ETL Tools can also be used to move data from one operational system to another. [11] 2.1.1 Extract The Extract step covers the data extraction from the source system and makes it accessible for further processing. Usually, data are retrieved from different source systems. These systems may use a different data organiza- tion or format so the extraction must convert the data into a format suitable for transformation processing. [11] This process should use as little resources as possible, it should be designed in a way that it does not negatively affect the source system in terms or performance, response time or any kind of locking. [12] 2.1.2 Transform The transform stage of an ETL process involves an application of a series of rules or functions to the extracted data. It includes validation of records and their rejection if they are not acceptable as well as integration part. While some data sources require very little or even no manipulation of data, other may require one or more transformations to meet the business and technical requirements of the target database. These transformations can include: • conversion • clearing of the duplicates 1. a database used for reporting and data analysis 3 2. ETL • standardizing • filtering and sorting • translating • looking up or verifying if the data sources are inconsistent A good ETL tool must enable building up of complex processes and extend- ing a tool library so custom user’s functions can be added. [11, 12] 2.1.3 Load The loading is the last stage of ETL process and it loads extracted and trans- formed data into a target repository. Specialized proprietary technologies for effective and optimal data storage are often used. [11] 2.2 CloverETL CloverETL is family of multiplatform software products implementing ETL processes created in Java. It consists of these products: [7] CloverETL Engine is the base member of the family. It a run-time layer that executes transformation graphs created in CloverETL Designer. The Engine is stand-alone Java library which can be embedded into other Java applications. CloverETL Designer is a powerful Java-based standalone application for data extraction, transformation and loading built upon extensible Eclipse platform. It allows users user-friendly creating of ETL transformations either locally or remotely on server via CloverETL Server. CloverETL Server is fully integrated with Designer and allows running ETL processes in server environment, where scheduling, parallel execution of graphs and load balancing can be achieved. 2.2.1 Transformation graph Transformation graph is directed acyclic graph and has to contain at least one node. Nodes represent components and are the most important part of the graph, while the edges connecting them behave as data channels. There are few other elements which can be found in the transformation graph like sequences, database connections and lookup tables. 4 2. ETL Each graph is also divided into number of smaller units called phases. Every graph contains at least one phase and every node belongs to exactly one phase and during graph execution they are sequentially executed. 2.2.1.1 Components As mentioned before, components are the most important graph elements. Typically, each of the components executes single data transformation. Most of the components have ports through which they can receive data and/or send the processed data out and most of them work only when edges are connected to these ports. Each edge in a graph connected to some port must have metadata assigned to it.

Load more