<<

MASARYK UNIVERSITY

FACULTY OF INFORMATICS

Æ

File Compression and Decompression in CloverETL

BACHELOR’STHESIS

Sebastián Lazo ˇn

Brno, spring 2014 Declaration

Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Sebastián Lazoˇn

Advisor: doc. RNDr. Tomáš Pitner, Ph.D.

i Acknowledgement

I would like to express my thanks to the Javlin’s employees, especially Mgr. Jan Sedláˇcekfor their time, provided assistance and feedback on problems throughout the process of project development. I would also like to thank doc. RNDr. Tomáš Pitner, Ph.D. for valuable advice on the thesis’s text.

ii Abstract

The aim of the thesis was to create set of components for compression, de- compression and manipulation with compressed file archives for CloverETL. The thesis provides an overview of ETL processes and introduction to CloverETL, implemented archive formats and used external libraries in the first part, while design, implementation and testing of the developed com- ponents in the second.

iii Keywords

Java, CloverETL, compression, decompression, , ,

iv Contents

1 Introduction ...... 1 1.1 Motivation ...... 1 1.2 Purpose ...... 1 1.3 Structure ...... 2 2 ETL ...... 3 2.1 In general ...... 3 2.1.1 Extract ...... 3 2.1.2 Transform ...... 3 2.1.3 Load ...... 4 2.2 CloverETL ...... 4 2.2.1 Transformation graph ...... 4 2.2.1.1 Components ...... 5 2.2.1.2 Edges ...... 5 2.2.1.3 Sequences ...... 6 2.2.1.4 Lookup tables ...... 6 3 ...... 7 3.1 In general ...... 7 3.1.1 Lossy ...... 7 3.1.2 Lossless ...... 7 3.1.2.1 ZIP ...... 8 3.1.2.2 TAR ...... 9 3.1.2.3 GZIP ...... 9 3.1.2.4 The algorithm ...... 10 3.2 In Java ...... 10 3.2.1 Java.util.zip ...... 10 3.2.2 Apache Commons CompressTM ...... 11 3.2.3 TrueZip ...... 11 4 Analysis ...... 13 4.1 operation components ...... 13 4.1.1 Common attributes of Compressed and File Operation 13 4.2 Requirements ...... 14 4.2.1 Supported URIs ...... 14 5 Design ...... 15 5.1 Project architecture ...... 15 5.1.1 Components ...... 15 5.1.2 CompressedFileManager ...... 16 5.1.3 CompressedOperationHandler ...... 17

v 5.1.3.1 Resolve ...... 17 5.1.3.2 List ...... 18 5.1.3.3 Delete ...... 18 5.1.3.4 Copy/Move ...... 18 5.1.3.5 ...... 18 5.1.3.6 Decompress ...... 18 5.1.3.7 Other methods ...... 19 5.1.4 ArchiveInfo ...... 19 5.1.5 CompressedUtil ...... 19 5.2 Components attributes ...... 19 5.2.1 Input mapping ...... 19 5.2.2 Output mapping ...... 20 5.2.3 Error mapping ...... 21 6 Implementation ...... 22 6.1 Used external libraries ...... 22 6.2 Integration with CloverETL ...... 22 6.2.1 Integration with Engine ...... 22 6.2.2 Integration with Designer ...... 23 7 Testing and documentation ...... 24 7.1 Graph tests ...... 24 7.2 Unit tests ...... 24 7.3 Documentation ...... 24 8 Conclusion ...... 25 8.1 Further extension of functionality ...... 25 8.2 Summary ...... 25

vi 1 Introduction

Information are essential part of every enterprise whether as a subject of business or after analysis, by providing look at its functioning and help with its management. At first, these information saved in enterprise sys- tems have to be extracted and processed. But when we realize these data can be stored in different repositories, platforms and applications, we find out a specialized tool is needed. ETL, shorthand for extract, transform, load represents tools which provide this functionality. A properly designed ETL system extracts data from the source systems, enforces data quality and consistency, conforms data so that separate sources can be used together, and finally delivers data in presentation-ready format.[2]

1.1 Motivation

Javlin’s CloverETL is one of these tools. CloverETL represents group of multi-platform Java-based software products implementing ETL processes. It currently supports reading and writing of compressed data, but it has not been able to access and manipulate content of a compressed archive yet. During data writing, it is often more efficient to create files uncompressed and them compress them simultaneously.

1.2 Purpose

Purpose of this thesis is to create set of components for compression, de- compression and manipulation with compressed file archives for CloverETL. Components’ interface has to be similar to the existing ones from cate- gory FileOperation which are working with uncompressed files. Their fu- ture extension with new compression formats should be as simple as possi- ble. New component category CompressedFileOperations consists of these components:

ListCompressedFiles provides content listing of archives

DeleteCompressedFiles removes entries from archives

CopyCompressedFiles copies entries from one archive to another

MoveCompressedFiles moves entries from one archive to another

CompressFiles creates new archive or adds files to existing

1 1. INTRODUCTION

DecompressFiles decompresses archive entries to selected location

User and developer documentation are also part of the thesis.

1.3 Structure

Thesis is divided to eight chapters. The second is dedicated to introduction to field of ETL tools, presentation of CloverETL and explaining basic princi- ples of how it works. The third talks about compression methods and algo- rithms in general, compression support in Java and functionality provided by external libraries. Following parts contain description of analysis and design of the components, implementation, testing and writing documen- tation. In the conclusion, options for further development of the created components’ functionality are presented.

2 2 ETL

2.1 In general

The term ETL is essential part of data warehousing 1 and represents process of data extraction from data source, transformation to fit operational needs and loading to target location. After the data are collected from multiple sources (extraction), they are reformatted and cleansed for operational needs (transformation). Most of numerous extraction and transformation tools also enable loading of the data into the target location, mostly database, data warehouse or a data mart to be analyzed and allow developers to create applications and sup- port users decisions.[2] Except for data warehousing and business intelligence, ETL Tools can also be used to move data from one operational system to another. [11]

2.1.1 Extract The Extract step covers the data extraction from the source system and makes it accessible for further processing. Usually, data are retrieved from different source systems. These systems may use a different data organiza- tion or format so the extraction must convert the data into a format suitable for transformation processing. [11] This process should use as little resources as possible, it should be de- signed in a way that it does not negatively affect the source system in terms or performance, response time or any kind of locking. [12]

2.1.2 Transform The transform stage of an ETL process involves an application of a series of rules or functions to the extracted data. It includes validation of records and their rejection if they are not acceptable as well as integration part. While some data sources require very little or even no manipulation of data, other may require one or more transformations to meet the business and technical requirements of the target database. These transformations can include: • conversion

• clearing of the duplicates

1. a database used for reporting and data analysis

3 2. ETL

• standardizing

• filtering and sorting

• translating

• looking up or verifying if the data sources are inconsistent

A good ETL tool must enable building up of complex processes and extend- ing a tool library so custom user’s functions can be added. [11, 12]

2.1.3 Load The loading is the last stage of ETL process and it loads extracted and trans- formed data into a target repository. Specialized proprietary technologies for effective and optimal data storage are often used. [11]

2.2 CloverETL

CloverETL is family of multiplatform software products implementing ETL processes created in Java. It consists of these products: [7]

CloverETL Engine is the base member of the family. It a run-time layer that executes transformation graphs created in CloverETL Designer. The Engine is stand-alone Java library which can be embedded into other Java applications.

CloverETL Designer is a powerful Java-based standalone application for data extraction, transformation and loading built upon extensible Eclipse platform. It allows users user-friendly creating of ETL trans- formations either locally or remotely on server via CloverETL Server.

CloverETL Server is fully integrated with Designer and allows running ETL processes in server environment, where scheduling, parallel ex- ecution of graphs and load balancing can be achieved.

2.2.1 Transformation graph Transformation graph is directed acyclic graph and has to contain at least one node. Nodes represent components and are the most important part of the graph, while the edges connecting them behave as data channels. There are few other elements which can be found in the transformation graph like sequences, database connections and lookup tables.

4 2. ETL

Each graph is also divided into number of smaller units called phases. Every graph contains at least one phase and every node belongs to exactly one phase and during graph execution they are sequentially executed.

2.2.1.1 Components

As mentioned before, components are the most important graph elements. Typically, each of the components executes single data transformation. Most of the components have ports through which they can receive data and/or send the processed data out and most of them work only when edges are connected to these ports. Each edge in a graph connected to some port must have metadata assigned to it. Metadata describes the structure of data flow- ing through the edge from one component to another. Each of the compo- nents can also contain various attributes which can change way the compo- nent works. These components can be split to several groups according to their scope: [7]

Readers These components are usually initial elements of graph. Their job is to read data from data sources and transform them to individual records.

Writers On the contrary, writers are located at the end of graph most of the time. They format the records from input port and write them to target location.

Transformers Transformers transform the structure of records from input port and send them to output port.

Joiners Joiners join number of input records based on defined key to single output record.

Data quality This group of components performs various tasks related to quality of data - determining information about the data, finding and fixing problems etc.

File Operations The group of components designed for file system manip- ulation.

2.2.1.2 Edges

An edge is directed flowline between two components. Each of the edges has associated metadata and is used as a channel for records between these

5 2. ETL components. Structure of the records corresponds with metadata and can- not be changed during transfer. Order of the records is also unaltered.

2.2.1.3 Sequences Sequences are used for generating sequences of numbers and are primary used for unique ids of records. If needed, they can preserve their state be- tween individual graph executions too.

2.2.1.4 Lookup tables Lookup tables are data structures designed for storage and search of record under defined id. Each of the stored record must correspond to lookup table metadata.

6 3 Data compression

3.1 In general

Data compression, formally source encoding, is the process of reducing size of data by removing redundant information. This is desirable because we reduce the ammount of data we have to store, process or transmit. How- ever, this reduction is not for free, it requires extra data processing or data loss1 often both. Reducing redundancy also makes data less reliable, more prone to er- rors, therefore data integrity is increased by adding check and parity bits.[1] There are two types of compression, lossy and lossless, differing in amount of lost information.

3.1.1 Lossy

Lossy compression methods minimize size of data by selective loss of least significant data. It is most commonly used for multimedia data - data in- tended for human interpretation, because human mind is not perfect and it will not notice the difference or even can fill missing information to some extent. The ultimate goal is to provide the same perception as original, while removing as much data as possible. suffers from gener- ation loss - repeatedly compressing and decompressing the file will cause it to progressively lose quality, therefore for editation purposes the uncom- pressed original has to be used.

3.1.2 Lossless

In situations where loss of even a single bit is unacceptable, lossless com- pression methods are used. These methods remove only statistical redun- dancy, therefore they cannot be as effective as lossy methods, but com- pressed data can be reconstructed to identical copy of the original ones. Most programs do two things in sequence: the first step generates a statistical model for the input data, and the second step uses this model to map input data to bit sequences in such a way that most frequent will produce shorter output than infrequent ones. There are two primary ways of producing statistical models: in static, data is first analyzed

1. In case of lossy data compression

7 3. DATA COMPRESSION and then the model is constructed, adaptive dynamically create the model as data is compressed.

3.1.2.1 ZIP

ZIP is one of the most commonly used archive format which supports file compression, originally created by Philip Katz, was first implemented in PKWARE, Inc.’s PKZIP utility. It allows contained files to be compressed by many different methods, primarily uses DEFLATE algorithm. Files can be also stored without compression and each of the files is stored separately so each entry can be compressed using different method. Data is also supported, initially a simple password-based symmetric system, which is known to be seriously flawed, later AES-based standard was developed by WinZip. Many vendors use other formats such as DES or certificate- based encryption.[1] As shown in Figure 3.1, ZIP file is identified by the presence of central directory record at the end of file, which holds information about each en- try inside of archive. This metadata consists of entry name along with other data like entry size, compression method, modification time, etc. and off- set, pointing inside zip file, where the copy of entry header for the sake of redundancy followed by “extra” data fields and actual entry data are lo- cated. The “extra” data provide space for extensibility of the ZIP format, like ZIP64 format, which allows ZIP to handle files bigger than 4 gb, AES encryption, file attributes, etc. This internal structure allows us to access each entry individually, without decompressing the whole archive.[1][6]

RELATIVE OFFSET 1 RELATIVE OFFSET 2 RELATIVE OFFSET 3 RELATIVE OFFSET n

ENTRY 1 ENTRY 2 ENTRY 3 ENTRY n CENTRAL DIRECTORY

LOCAL HEADERS ENTRIES

Figure 3.1: ZIP archive structure

8 3. DATA COMPRESSION

3.1.2.2 TAR

TAR is archive format created in the early days of , originally devel- oped for tape where data are accessed sequentially. Therefore it lacks centralized location for content of archive (unlike ZIP’s central direc- tory) and does not support random access to its entries. Although TAR does not support data compression, it is often compressed by stand-alone com- pression format like GZIP, BZIP, XZ etc. The compressed file then gets its name by appending format-specific compressor suffix to its original name.2 A tar archive 3.2 consists of series of file entries terminated by an end-of- archive entry, which consists of two 512 blocks of zero . Each file entry consists of header containing entry name, statistics and checksum and con- tent of the file.[10]

END OF FILE ENTRY FILE ENTRY FILE ENTRY ARCHIVE

FILE HEADERS

Figure 3.2: TAR archive structure

3.1.2.3 GZIP

Gzip file format is base on DEFLATE algorithm and is often used in combi- nation with TAR to compress archived data. Although GZIP supports con- catenating multiple streams and thus compressing multiple files, most of the time it is used to compress single file. A Gzip file as shown on Figure 3.3 consinst of:

10- header including magic number3, version number and timestep, optional extra headers with additional information, deflate-compressed original data,

8-byte footer containing crc4 checksum and original uncompressed data length.[4]

2. For example: file.tar to file.tar.gz 3. A number embedded at or near the beginning of a file that indicates its file format 4.

9 3. DATA COMPRESSION

EXTRA DEFLATE-COMPRESSED DATA HEADER FOOTER HEADERS

Figure 3.3: GZIP file structure

3.1.2.4 The DEFLATE algorithm DEFLATE is data compression algorithm designed by Philip Katz, first im- plemented as a part of ZIP file format in PKZIP. Since then, it has been used in many applications including HTTP protocol, PNG5 and PDF6. Deflate is based on combination of LZ77 and , therefore compression is achieved through two steps. In the first stage, if a duplicate series of bytes is spotted, then a back-reference is inserted, linking to the previous location of that identical string instead. Relative back-references can be made across any number of blocks, as long as the distance appears within the last 32 kB of uncompressed data decoded7. The second compression stage consists of replacing commonly used sym- bols with shorter representations and less commonly used symbols with longer representations. The used method is Huffman coding which creates an unprefixed tree of non-overlapping intervals, where the length of each sequence is inversely proportional to the probability of that symbol needing to be encoded.[1] The more likely a symbol has to be encoded, the shorter its bit-sequence will be.

3.2 In Java

This section provides basic information about data compression support in Java and used external libraries.

3.2.1 Java.util.zip package Java provides classes for reading and writing the standard ZIP and GZIP file formats since JDK1.1. Also includes classes for compressing and decom- pressing data using the DEFLATE compression algorithm, which is used

5. Portable Network Graphics 6. Adobe’s Portable Document File 7. Also called the sliding window

10 3. DATA COMPRESSION by the ZIP and GZIP file formats. Additionally, there are utility classes for computing the CRC-32 and Adler-32 checksums of arbitrary input streams. [3]

3.2.2 Apache Commons CompressTM

The Apache Commons Compress library defines an API for working with , , Unix dump, tar, zip, gzip, XZ, Pack200, , , , lzma, and Z files.[8] The compress component is split into compressors which process single file streams and archivers processing archives containing ArchiveEntry instances representing files and directories. From the required formats, the gzip support is provided by the java.util.- zip package, while tar and zip uses custom implementation providing ca- pabilities that go beyond the features of java.util.zip. The compress com- ponent provide factories that can be used to choose implementations by algorithm name and in case of input stream, it can also guess the format and provide the matching implementation.

3.2.3 TrueZip

TrueZIP is a Java based virtual file system (VFS) which enables client ap- plications to perform CRUD (Create, Read, Update, Delete) operations on archive files as if they were virtual directories.[9] TrueZIP uses a three-tier architecture:

Access tier Since TrueZIP 7.2, two client API8 modules are available which can be used to access virtual file systems. TrueZIP File requires Java SE 6 and provides classes which can be used the same way as java.io.- File* classes. TrueZIP Path requires Java SE 7 because it provides the TFileSystemProvider class to implement a file system provider for NIO.2 API (JSR 203).

Kernel tier TrueZIP Kernel module implements virtual file systems, man- ages their state and commits unsynchronized changes if required or requested. It uses file system drivers to access these resources and provides federating, multithreading, multiplexing, caching and ac- counting so that archive file system drivers do not need to take care of these aspects of a virtual file system.

8. Application programming interface

11 3. DATA COMPRESSION

Driver tier TrueZIP supports file systems via its pluggable file system driver architecture. It comes with multiple file systems, specifically FILE, ZIP, TAR and TGZ drivers were used in this project. Also, new file systems can be implemented by user and plugged in.

12 4 Analysis

This chapter discusses the current group of components FileOperation which was used as a template for implemented components. It also covers require- ments the produced component had to meet.

4.1 File operation components

The FileOperation components allows user to manipulate file system’s files and directories. These components support listing of directories, reading file attributes, creating, copying, moving and deleting files or directories. Files can be accessed either locally or remotely via FTP and HDFS. Limited support of other protocols like HTTP is also included. How- ever, manipulation with content of archived files is not supported.

4.1.1 Common attributes of Compressed and File Operation These attributes of File Operation [7] are inherited by Compressed File Op- eration and have the exactly same function in both groups of components. For Compressed File Operation specific attributes proceed to 5.2. Input mapping The operation will be executed for each input record. If the input edge is not connected, the operation will be performed exactly once. The attributes of the components may be overridden by the values read from the input port, as specified by the Input mapping.

Output mapping Some components can contain two output ports, output and optional error port. Output port represents result record gener- ated by component.

Error mapping By default, the component will cause the graph to fail if it fails to perform the operation. This can be prevented by connecting the error port. If the error port is connected, the failures will be send to the error port and the component will continue. The standard output port may be also be used for error handling, if the Redirect error output option is enabled.

Redirect error output If set to true, errors will be sent to the output port instead of the error port.

Stop processing on fail By default, a failure causes the component to skip all subsequent operations and send the information about skipped

13 4. ANALYSIS

operations to the error output port. This behavior can be turned off by this attribute.

Verbose output If enabled, one input record may cause multiple records to be sent to the output (e.g. as a result of wildcard expansion). Other- wise, each input record will yield just one cumulative output record.

4.2 Requirements

Inability of FileOperation to handle archive files should be solved by creat- ing new component group, CompressedFileOperation which will provide similar functionality as FileOperations but on entries of archived files. A total of six components will be created: ListCompressedFiles, DeleteCom- pressedFiles, CopyCompressedFiles and MoveCompressedFiles which func- tionality will be derived from their FileOperation counterparts and two new components: CompressFiles and DecompressFiles which will allow adding new files to archives and decompress them. At first, only the most common archives shall be supported: .zip, .tar and .tar.gz but possibility of further extensions should be kept on mind during design. Developer and user documentation is also part of the requirements and this thesis. In next section, examples of required URIs1 with descriptions are listed.

4.2.1 Supported URIs All local and remote URIs supported by FileOperation, zip:(/path/archive.zip) equals /path/archive.zip when not ex- plicitly said otherwise zip:(/path/archive.zip)#entry/file.txt A single entry inside an archive zip:(tgz:(/path/archive.tar.gz)#archive.zip)#dir/file.txt A single entry inside a nested archive zip:(tgz:(/path/*.tar.gz)#archive.zip)#entry?/file.* Wildcards2 are supported but they may be used in the outer compressed files, innermost folder and innermost file names. They cannot be used in the inner folder and inner zip file names

1. Uniform resource identifier 2. a special symbol in address (? or *) which can be replaced by exactly one or zero and more characters

14 5 Design

This chapter describes the structure of implemented components, and ex- plains its integration and how the archive data are processed.

5.1 Project architecture

The project consists of two separate parts, the engine plugin located at cloveretl.component.compress and the designer plugin at clover.gui.compress. At 5.1 is class diagram of engine plugin followed by description of con- tained classes.

CompressedFileOperationHandler <> <> list(SingleURI, ListParameters); CompressedUtils delete(URI, DeleteParameters); isWildcard(); copy(URI, SingleURI, CopyParameters); wildcardToRegex(); move(URI, SingleURI, MoveParameters); createURI(); compress(URI, SingleURI, CompressParameters); ... decompress(URI, SingleURI, DecompressParameters);

<> <> <>

<> CompressedFileManager ArchiveEntryInfo listCompressed(String, ListParameters); deleteCompressed(String, DeleteParameters); getName(); copyCompressed(String, String, CopyParameters); getURI(); moveCompressed(String, String, MoveParameters); getModifiedTime(); compress(String, String, CompressParameters); ... decompress(String, String, DecompressParameters);

<> <> <>

ZipArchiveEntryInfo TarArchiveEntryInfo Component canRead(); canRead(); canWrite(); canWrite(); execute(){ canExecute(); canExecute(); CompressedFileManager...... operation(parameters);}

Figure 5.1: CompressedFileOperation class diagram

5.1.1 Components

Each one of the implemented components has its own definition class. This class has to be derived from org.jetel.graph.node class which describes ba- sic component operations which have to be implemented in order to run properly. In case of CompressedOperation and FileOperation, AbstractFile- Operation is added in between and takes care of operations they have in common like input/output mapping, common parameters etc.

15 5. DESIGN

Structure of a component is inherited from Node and contains these methods: chceckConfig(); checks component’s configuration and makes sure every required attributes are set. init(); creates input/output/error mapping. preExecute(); prepares component for execution of operation. executeOperation(); calls operation method on instance of Compressed- FileManager and passes operation’s parameters and source/target URIs as strings. processInput/Output/Error(); postExecute(); perform necessary operations after execution of operation. fail(); performs necessary operations like resource freeing in case of fatal error.

5.1.2 CompressedFileManager CompressedFileManager replaces FileManager although their functionality is not very different. According to the requirements, alteration of existing code was not allowed so new manager was created. CompressedFileMan- ager transforms strings to CloverURIs1 and performs basic URI validation. However these addresses do not always refer to single file. They can contain wildcards and even more CloverSingleURIs2 so the original im- plementation resolved these adresses and only SingleCloverURIs with re- solved wildcards were passed to corresponding operation handler. While dealing with ordinary files and directories, this approach was completely valid, archives require different approach. Let us imagine we want to delete every text file from the root of an zip archive. SingleCloverURI zip:(:\archive.zip)#*.txt is created in FileManager and resolved to zip:(C:\archive.zip)#1.txt zip:(C:\archive.zip)#2.txt zip:(C:\archive.zip)#3.txt

1. CloverETL implementation of URI 2. CloverURI which contains exactly one physical address

16 5. DESIGN

Each URI is passed separately to CompressedOperationHandler for pro- cessing resulting in archive being opened, searched and altered three times which is suboptimal. Instead, CloverURI is passed to CompressedOpera- tionHandler as is and where resolution takes place.

5.1.3 CompressedOperationHandler An OperationHandler is a class which executes defined operation on dif- ferent repositories and implements IOperationHandler interface. One of these classes is also CompressedOperationHandler which resolves and ex- ecutes operation on archives. There are several methods an OperationHan- dler have to implement and CompressedOperationHandler adds two new: compress and decompress. Here is their list and description of functional- ity:

5.1.3.1 Resolve • Resolve receives SingleCloverURI as parameter and recursively pro- cesses given URI. In each iteration, if supported archive URI scheme3 is present, URI fragment4 is added to list and the rest is passed to next. • If the scheme does not indicate an archive the original FileManager is called to find a operation handler and resolve it. • The list of resolved URIs, operation handler and list of fragments are then returned. For example, resolve method is called with an argument: zip:(tgz:(C:\archive?.tar.gz)#archive.zip)#dir/*.txt First iteration extracts URI’s scheme zip, which is recognized archive scheme, adds dir/*.txt to fragments and passes tgz:(C:\archive?.tar.gz)#archive.zip Similarly, scheme tgz is recognized and archive.zip is added to frag- ments and C:\archive?.tar.gz is passed. • When empty scheme in the next step is received, FileManager finds suitable handler, in our case LocalOperationHandler which executes operation on local files and finds files that satisfy given URI.

3. top level of URI, describes context in which 4. optional part of URI, provides path to secondary source e.g. section in an article or entry in an archive

17 5. DESIGN

5.1.3.2 List

• After resolving the URIs, for every resolved URI the corresponding resolved handler is asked for inputStream.

• If successful, the inverse process to resolving is applied. If the frag- ments list is not empty, scheme and fragment are appended to base URI and passed as parameter for next iteration.

• If wildcards are detected in fragment, it is then converted to regular expression by wildcardToRegex method and each of the entries is then tested for match.

• If the fragments list is empty, it means we have reached the re- quired file/directory/archive and we either return an instance of ArchiveEntryInfo class or continue processing for directory/archive. It depends on component’s parameters.

5.1.3.3 Delete

Delete works similarly to list but since it alters the archive, it has to be ac- cessible locally and if the URI refers to remote file, it has to be downloaded.

5.1.3.4 Copy/Move

Copy and move work similarly to delete but since they work with more archives at the time, at least two: single source and target, the process has to be repeated several times.

5.1.3.5 Compress

Compress can be divided to two use-cases. If we want to add files to an ex- isting archive, copy with adapted parameters is called. If we want to create new archive, separate compress implementation is called which basically reads input files and them to temporary archive which is then moved to its destination defined by targetURI.

5.1.3.6 Decompress

Decompress is implemented as copying from archive to regular file or di- rectory.

18 5. DESIGN

5.1.3.7 Other methods GetInput, getOutput and getFile return archive input/output stream or file if possible.

5.1.4 ArchiveInfo Purpose of this abstract class is to provide general information about an entry of an archive. This class has to be extended by an archive-specific child class to provide also information which depends on entry type like Zi- pArchiveEntry or TarArchiveEntry. The provided information are defined by the Info interface and are described in 5.2.2 in detail.

5.1.5 CompressedUtil CompressedUtil is an utility class which provides functionality for both the CompressedOperationHandler and ArchiveInfo. It offers methods for wild- card handling like isWildcard and toWildcard, operations on URIs like cre- ateNewURI, getScheme and contains every archive-type-dependent method used in this project.

5.2 Components attributes

This section covers the input, result and error attributes of the implemented CompressedFileOperation components and their short description. These attributes are derived from attributes of File Operation in a large extent. For description of common atributes go to 4.1.1. Complete components ref- erence can be found in attached user documentation.

5.2.1 Input mapping List/DeleteCompressedFiles

FileURL Path to archive file 4.2.1 Recursive If set to true, directories are listed/deleted recursively.

Copy/Move/DecompressFiles

SourceURL Path to archive file. 4.2.1 TargetURL Copy/Move – path to single archive entry, Decompress – path to single file or directory.

19 5. DESIGN

Recursive If set to true, directories are copied/moved/decompressed recursively. Overwrite Specifies whether existing files shall be overwritten. Create parent directories If set to true, attempts to create nonexis- tent target directories.

CompressFiles

SourceURL Path to source file or directory. TargetURL If create new archive is set to true – path to new archive, else – path in existing archive. Create new archive If set to true, new archive is created, otherwise files are added to existing one. Archive type Type of new archive. (ignored when create new archive is false) Compression level Level of compression (ignored when create new archive is false) Recursive If set to true, directories are compressed recursively. Overwrite Specifies whether existing files shall be overwritten. Create parent directories If set to true, attempts to create nonexis- tent target directories.

5.2.2 Output mapping Common attributes

Result True if operation has succeeded, can be false if redirect error output is set. Error message Used only with Redirect error output is set. Stack trace Used only with Redirect error output is set.

ListCompressedFiles

URL URL of entry inside of archive. Name Entry’s name. Can read True if entry can be read, if not set returns true. Can write True if entry can be written, if not set returns true.

20 5. DESIGN

Can execute True if entry can be executed, if not set returns true. Is directory True if entry is directory. Is file True if entry is regular file. Is hidden True if entry is hidden file. Last modified Entry’s last modified date Size Entry’s compressed size

DeleteCompressedFiles

File URL URL of deleted entry in archive.

Other components

Source URL URL of source file/directory/entry. Target URL URL of target file/directory/entry. Result URL URL of newly created file/directory/entry. Only when verbose output is set.

5.2.3 Error mapping Common attributes

Result Will be always set to false. Error message The error message. Stack trace The stack trace of the error.

DeleteCompressedFiles

File URL URL of deleted entry in archive.

Other components

Source URL URL of source file/directory/entry. Target URL URL of target file/directory/entry.

21 6 Implementation

This chapter contains information about external libraries, how and where they have been used. It also describes the process of integration of the com- ponents with CloverETL.

6.1 Used external libraries

This implementation uses the Apache Commons CompressTM and TrueZIP libraries for processing of archived files. While TrueZIP provides abstrac- tion for manipulation with archived files and saves implementation time, Apache’s Compress has proved to be more efficient in some cases. ListCompressedFiles uses the Compress, because it can retrieve entry information from archiveStream so the file does not have to be decom- pressed and available locally. CompressFiles also uses this library, because file compression is not a difficult operation to code and it performed slightly better. The rest of the components uses the TrueZIP. Since these operations re- quire the file to be accessible locally, reinventing the wheel using Compress would not gain any performance.

6.2 Integration with CloverETL

In order to function properly with CloverETL Designer, several configura- tion files had to be created. Here is a brief list of these files and their pur- pose:

6.2.1 Integration with Engine plugin.xml has to be created in engine plugin in order to make the com- ponents known to ComponentFactory1. It contains list of all used external libraries and all components with name of their definition classes.

In addition to this, every custom component definition class has to extend org.jettel.graph.Node abstract class.

1. a Java class that creates components based on component type and XML parameter definition

22 6. IMPLEMENTATION

6.2.2 Integration with Designer components.xml is component definition file in designer plugin and its purpose is to define components input and output ports, parameters which configure component, and other necessary info for CloverETL Designer. plugin.xml is configuration file in designer plugin. It registers the compo- nents with CloverETL Designer. build.xml is build file.

23 7 Testing and documentation

7.1 Graph tests

CloverETL uses graph tests to verify interaction and functionality of com- ponents. Each graph test consists of graph definition, input data and ex- pected output data, which are then compared to actual data produced by graph execution. Tests are executed automatically during build to detect errors as soon as possible. Several graph test for each of the implemented components were created, starting with basic operation with one archive, ending with nested archives and wildcards.

7.2 Unit tests

In addition to graph tests, some functionality is tested by JUnit test frame- work. These tests focus on essential operations like wildcard and URI han- dling. Every test class has to extend the CloverTestCase which takes care of engine initialization.

7.3 Documentation

Creating user and developer documentation was also part of the require- ments. The user documentation is inspired by existing component reference and describes components’ parameters and output data structure. Supported URI types as defined in 4.2.1 are mentioned too. Developer documentation describes content of the component.compress package and functionality of the components similar to 5.1. It will be used to prepare the components for production and further extension of functionality. In addition to this, source code of the components and helper classes are documented with JavaDoc and correspond with existing CloverETL documentation.

24 8 Conclusion

8.1 Further extension of functionality

Expandability of the implemented components was one of the key require- ments. Set of supported archive formats is likely to extend in the future and this process will depend on whether Apache Commons Compress will support these new formats. One of these two situations can happen:

Compress supports new formats in the component.compress.entry - age, new class which extends ArchiveEntry will be created and in component.compress.util the CompressedUtil’s methods which are archive-dependent have to be updated.

Compress does not support new formats in this case, newly added library will probably not implement interface of the Compress, therefore a new operation handler will have to be created. This new han- dler will then use CompressedOperationHandler similarly like this implementation uses existing operation handlers to access archive files.5.1.3

8.2 Summary

The main objective of this thesis was to create set of components for com- pression, decompression and manipulation with compressed file archives for CloverETL. The components were integrated with both CloverETL En- gine and Designer. The interface of the components is derived from existing File Operations group of components and new features are simple to under- stand for user. Code is properly commented, user and developer documen- tation were also created and are attached.

25 Bibliography

[1] SALOMON, David. Data Compression: the complete reference. 4th ed. London: Springer, 2007, xxv, 1092 s. ISBN 978-1-84628-602-5.

[2] KIMBALL, Ralph; CASERTA, Joe. The data warehouse ETL toolkit practical techniques for extracting, cleaning, conforming, and deliv- ering data. Indianapolis: Wiley, 2004. ISBN 07-645-7923-1.

[3] Oracle Corporation. Java SE 7 Documentation. [online]. Available: http://docs.oracle.com/javase/7/docs/

[4] Gzip file format specification version 4.3. [online]. Available: http://www.ietf.org/rfc/rfc1952.txt

[5] DEFLATE Compressed Data Format Specification version 1.3. [online] Available: http://www.ietf.org/rfc/rfc1951.txt

[6] PKWARE Inc. APPNOTE.TXT - .ZIP File For- mat Specification version 6.3.3 [online] Available: http://www.pkware.com/documents/casestudies/APPNOTE.TXT

[7] Javlin a.s. User’s Guide. [online] Available: doc.cloveretl.com/documentation/UserGuide/

[8] The Apache Software Foundation. Commons Compress - Overview. [online] Available: http://commons.apache.org/proper/commons- compress/

[9] Schlichtherle IT Services. TrueZIP Key Features. [online] Available: https://truezip.java.net/features.html

[10] Foundation, Inc. GNU tar 1.27: Basic Tar Format. [online] Available: http://www.gnu.org/software/tar/manual/html_node/Standard.

[11] Goli Info. ETL – Extract Transform Load. [online] Available: http://www.etltools.org/

[12] Javlin, a. s. Data Integration Info. [online] Available: http://www.dataintegration.info/etl

26 Attachments

Electronic archive contents

• cloveretl.component.compress – CloverETL engine plugin classes

• cloveretl.gui.compress – CloverETL designer plugin classes

• cloveretl.test.scenarios – graph tests

• documentation – user and developer documentation

27 Compressed file operation user documentation

Group of components similar to the file operation, differs in the way it works with archived files. Where file operation handles archives as standard files, compressed file operation allows user to work with them as if they were directories. This group of components supports listing, deleting, copying, moving, decompressing entries of supported archives (Currently .zip, .tar and .tar.gz) and adding files to existing or compressing to new archive.

List of components ListCompressedFiles – provides content listing of archives DeleteCompressedFiles – removes entries from archives CopyCompressedFiles – copies entries from one archive to another MoveCompressedFiles – moves entries from one archive to another CompressFiles – creates new archive or adds files to existing DecompressFiles – decompresses archive entries to selected location

Component reference – input parameters Common input parameters Attribute Req Description Possible values Redirect error output no If enabled, errors will be sent to the output true | false port instead of the error port. Verbose output no If enabled, one input record may cause true | false multiple records to be sent to the output Stop processing on no By default, a failure causes the component true | false fail to skip all subsequent operations and send the information about skipped operations to the error output port. This behaviour can be turned off by this attribute.

List/DeleteCompressedFiles Attribute Req Description Possible values File URL yes Path to the file or directory inside of an archive Recursive no List/Delete directories recursively true | false Copy/MoveCompressedFiles/DecompressFiles Attribute Req Description Possible values Source file URL yes Path to an entry inside of an archive Target file URL yes Copy/Move: Path to the file or directory inside of an archive Decompress: Path to the file or directory

Recursive no Copy/Move/Decompress directories true | false recursively Overwrite no Whether existing files should be overwritten always | update | never Create parent no Creates nonexistent target directories true | false directories

CompressFiles Attribute Req Description Possible values Source file URL yes Path to the source file or directory Target file URL yes Path to the destination file or directory inside of an archive. When create new archive set to true, represents target archive. Create new archive no TRUE – create new archive true | false FALSE – add files to existing archive Archive type no Type of target archive ZIP | TAR | TAR.GZ (relevant only if create new archive is set) Compression level no Level of compression no compression | (relevant only if create new archive is set) fastest compression | default compression | best compression Recursive no Compress directories recursively true | false Overwrite no Whether existing files should be overwritten always | update | never Create parent no Creates nonexistent target directories true | false directories

Component reference – result Common result entries Field name Type Description result boolean True if operation has succeeded (can be false if redirectError is set) errorMessage string (used only when redirectError is set) stackTrace string (used only when redirectError is set) ListCompressedFiles Field name Type Description URL string URL of entry in an archive name string Entry's name canRead boolean Entry permissions, if not set returns true canWrite boolean canExecute boolean isDirectory boolean True if entry is directory isFile boolean True if entry is regular file isHidden boolean True if entry is hidden lastModified date Entry's last modified date size long Entry's compressed size

DeleteCompressedFiles Field name Type Description fileURL string URL of deleted entry in archive

Other components Field name Type Description sourceURL string CompressFiles: URL of source file or directory Other: URL of source archive entry targetURL string DecompressFiles: target file or directory URL Other: URL of target archive entry resultURL string New URL of processed file/directory/entry

Component reference – error Field name Type Description sourceURL string URL of the source file/directory/entry targetURL string URL of the target file/directory/entry result boolean Allways set to false errorMessage string The error message stackTrace string The stack trace of the error Supported URL formats • All URLs supported by file operation • zip:(/path/archive.zip) equals /path/archive.zip • zip:(/path/archive.zip)#entry/file.txt Entry inside of archive • zip:(tgz:(/path/archive.tar.gz)#archive.zip)#entry/file.txt Entry inside of nested archives • zip:(tgz:(/path/*.tar.gz)#archive.zip)#entry?/file.* Wildcards (? and *) may be used in the outer compressed files, innermost folder and innermost file names. They cannot be used in the inner folder and inner zip file names • All supported remote URLs Compressed file operation developer documentation

CompressedFileOperation is group of components based on FileOperation, focused on working with archive files, and uses Apache Commons Compress library for listing and compression and TrueZIP library for other file operation. Currently consists of 6 components: ListCompressedFiles, DeleteCompressedFiles, CopyCompressedFiles, MoveCompressedFiles, CompressFiles and DecompressFiles. You can find it in com.cloveretl.component.compress package. com.cloveretl.component.compress package content Compress – component definition classes (extend abstractFileOperation), mostly identical to their fileOperation counterparts, added initialization and removal of temporary directory, CompressFiles and DecompressFiles are derived from CopyCompressedFiles and adjusted to work with their parameters and results – CompressedFileManager checks validity of URIs and makes sure CompressFileOperationHandler is called for each operation – CompressFileOperationHandler (where the actual archive processing takes place) is one of many OperationHandlers, supposed to implement IOperationHandler interface – which it does, but also adds new methods (compress, decompress) which are not present in the interface. – CompressedFileOperationMessages holds custom error messages which are not contained in FileOperationMessages compress.entry – ArchiveEntryInfo is abstract class which implements Info interface and provides basic information like getName() or isDirectory of archive entry – Zip/Tar/...ArchiveEntryInfo extends ArchiveEntryInfo and provides ArchiveType dependent information compress.parameters – Compress/DecompressParameters simmilar to CopyParameters provide information on how to process archive compress.result – Compress/DecompressResult represents result of operation compress.util – CompressedUtils provide tools to work with URI's inside archives

How it works 1. CompressFileOperation component calls execute on CompressedFileManager instead of FileManager, because editation of existing classes was not permitted. 2. CompressedFileManager converts arguments from string to URI, but does not resolve them, just passes them to compressedOperationHandler. Resolving URIs to single files in handler would be uneffective, because same compressed file would have to be reopened for every URI inside of it. 3. Instead, the unresolved URI is passed to compressedOperationHandler where required operation is called. operation(arguments){ resolve(arguments.URI); foreach(resolved){ IF(!localFile)-get resolved.handler and download file ELSE result = localOperation(File, resolved.innerPath)} return result; }

4. URI is resolved using resolve() method and outputs: resolvedURIs – URI of file which can be accessed by existing handlers, baseHandler – existing handler with which resolvedURIs can be accessed, fragments – inner archive paths. EXAMPLE: URI=zip:(.DATA-IN/file?.zip)#*.txt → baseHandler=LocalOperationHandler resolvedURIs=C:\DATA-IN/file1.zip, C:\DATA-IN/file2.zip,... ` fragments=*.txt 5. IF file is local, then is passed for processing, else is downloaded to temporary directory, processed as local and uploaded back if has been changed !IMPORTANT ListCompressedFiles and CompressFiles work only with streams – they do not use TrueZIP library unlike the others, only Apache Commons Compress localOperation(File, innerPath){ if(innerPath.isEmpty()){ return file.trueZipOperation else{ if(innerPath.isWildcard()) processWildcards(); return localOperation(new File(File, innerPath.pollFirst()), innerPath); } 6. IF innerPath is empty – means we are already at file we want to process, else process wildcards and dive one level deeper into file. Note that wildcards are only permitted as the last innerPath com.cloveretlgui.compress Definition of components' GUI

Tests cloveretl.test.scenarios – graph tests cloveretl.component.compress – unit tests testing URI processing

Further extension of components 2 situations can happen: 1. Added is supported by Apache commons compress and TrueZIP In component.compress.entry package, you have to create new class which extends the ArchiveEntryInfo class In component.compress.util you have to edit getScheme(), isCompressed(), getInput/Output(), getNewEntry() and getInfo() by adding the wanted archive

2. Added file format is not supported. New operation handler with own implementation will have to be created as new library will probably not be compatible with Commons Compress and TrueZIP.