Analysis of Big Data Storage Tools for Data Lakes Based on Apache Hadoop Platform
Total Page:16
File Type:pdf, Size:1020Kb
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 8, 2021 Analysis of Big Data Storage Tools for Data Lakes based on Apache Hadoop Platform Vladimir Belov, Evgeny Nikulchev MIREA—Russian Technological University, Moscow, Russia Abstract—When developing large data processing systems, determined the emergence and use of various data storage the question of data storage arises. One of the modern tools for formats in HDFS. solving this problem is the so-called data lakes. Many implementations of data lakes use Apache Hadoop as a basic Among the most widely known formats used in the Hadoop platform. Hadoop does not have a default data storage format, system are JSON [12], CSV [13], SequenceFile [14], Apache which leads to the task of choosing a data format when designing Parquet [15], ORC [16], Apache Avro [17], PBF [18]. a data processing system. To solve this problem, it is necessary to However, this list is not exhaustive. Recently, new formats of proceed from the results of the assessment according to several data storage are gaining popularity, such as Apache Hudi [19], criteria. In turn, experimental evaluation does not always give a Apache Iceberg [20], Delta Lake [21]. complete understanding of the possibilities for working with a particular data storage format. In this case, it is necessary to Each of these file formats has own features in file structure. study the features of the format, its internal structure, In addition, differences are observed at the level of practical recommendations for use, etc. The article describes the features application. Thus, row-oriented formats ensure high writing of both widely used data storage formats and the currently speed, but column-oriented formats are better for data reading. gaining popularity. A big problem in the performance of platforms for storing and processing data is the time to search and write information, Keywords—Big data formats; data lakes; Apache Hadoop; data as well as the amount of data occupied. Managing the warehouses processing and storage of large amounts of information is a I. INTRODUCTION complex process. One of the most important tasks of any systems for data In this regard, when building big data storage systems, the processing is a problem of storing the data received. In problem arises of choosing one or another data storage format. traditional approaches, the most popular tools for storing data To solve this problem, it is necessary to proceed from the have been the use of relational databases [1], which represent a assessment results according to several criteria. convenient interface in the form of SQL for manipulating data. However, testing and experimental evaluation of formats The growth in the volume of data and the needs of does not always provide a complete understanding of the consumers of data processing systems has led to the emergence possibilities for working with a particular data storage format. of the big data concept [2, 3]. Big data concept is based on six In this case, it is necessary to study the features of the format, aspects such as value, volume, velocity, variety, veracity, and its internal structure, recommendations for use, etc. variability [4]. It means that big data can be understood through not only the volume, but also their ability to be sources The aim of this paper is to analysis the formats used for for generating valuable information and ideas [5]. data storing and processing in data lakes based on Apache Hadoop platform, their features, and possibilities in application New concepts have replaced traditional forms of data for various tasks, such as analytics, streaming, etc. This study storage, among which NoSQL [6] solutions and so-called data is useful when developing a system for processing and storing lakes [7-9]. A data lake is a scalable system for storing and big data, as it comprehensively explores various tools for analyzing data retained in their native format and used for storing and processing data in data lakes. In turn, a knowledge extraction [6]. A data lake can either be designed misunderstanding of the features of the structure and from scratch or developed on the basis of existing software recommendations for the use of tools for storing data can lead solutions [8]. Many implementations of data lakes use Apache to problems at the stage of data processing systems Hadoop as a basic platform [9]. maintenance. For data lakes built based on the Apache Hadoop The article describes both well-known and widely used ecosystem, HDFS [10] is used as a basic file system. This file formats for storing big data, as well as new formats that are system is cheaper for use than commercial data bases. Using gaining popularity now. such data warehouse, choosing the right file format is critical [11]. File format determines how information would be stored The paper is organized as follows. In the Background in HDFS. It is required to take into account that Apache section, the main prerequisites for the emergence of data lakes, Hadoop and HDFS does not have any default file format. This as well as the features of the file formats used to store data in 551 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 8, 2021 data lakes built on the basis of the Hadoop platform will be serialization and deserialization using this format by default. discussed. Challenges section explores emerging storage trends This also applies to streaming data processing systems. for building data lakes. JSON supports following data types: II. BIG DATA STORAGE FORMATS primitive: null, boolean, number, string; Relational databases are the traditional way of storing data [22]. One of the obvious disadvantages of this storage method complex: array, object. is the need for a strict data structure [23]. CSV (comma-separated values) is a textual file format In recent years, the direction of development of the so- presented in the form of a table, the columns of which are called data lakes has gained popularity [7–9]. A data lake is a separated by a special character (usually a comma). The file scalable system for storing and analyzing data retained in their may also contain a header containing the names of the native format and used for knowledge extraction [7]. The columns. Despite its limitations, CSV is a popular choice for prerequisites for this were the following factors: data exchange because it supports a wide range of business, consumer and scientific applications [13]. In addition, many Growth in the volume of unstructured data, such as the batch and streaming systems (e.g. Apache Spark [25]) support content of web pages, service logs, etc. For these this format by default. purposes, it is not assumed that there is a common format. B. Hadoop-specific Formats SequenceFile [14] is a binary format for storing data. The The need for storing and analyzing large amounts of file structure is represented as serialized key-value pairs. The semi-structured data, such as events from the data bus, peculiarity of this file is that it was specially developed for the unloading from operational databases, etc. Apache Hadoop ecosystem. The structure allows you to split Development of OLAP technologies and analytics- the file into sections during compression, which provides oriented data storage facilities. parallelism in data processing. Development of streaming processing and data SequenceFile is a row-oriented format. The file structure transmission. consists of a header followed by one or more entries. The header provides technical fields such as the version number, A data lake can either be designed from scratch or information about whether the file is compressed, and the file's developed on the basis of existing software solutions [8]. Many metadata. implementations of data lakes use Apache Hadoop as a basic platform [9]. For such systems, HDFS [10] is used as a basic There are three different SequenceFile formats depending file system. The traditional way of storing data in HDFS is to on the type of compression. create files of various formats. no compression; According to the internal structure of the file, the formats record compression – each entry is compressed as it is used for working with big data can be divided into the added to the file; following groups: block compression – compression is performed when Textual formats: CSV, JSON. data reaches block size. Hadoop-specific formats: Sequence files. C. Column-oriented Formats Column-oriented formats: Parquet, ORC. Apache Parquet [15] is a binary column-oriented data storage format. The format architecture is based on "definition Row-oriented formats: Avro, PBF. levels" and "repetition levels". An important part of this format Each of the groups is focused on solving specific problems. is the presence of metadata that stores basic information about Thus, row-oriented formats ensure high writing speed, but the data in a file, which contributes to faster filtering and data column-oriented formats are better for data reading. aggregation in analysis tasks. Each data storage format will be discussed below. The file structure is represented by several levels of division: A. Textual Formats JSON (JavaScript object notation) is a textual file format, row group - row-by-row data splitting into rows for represented as an object consisting of key-value pairs. The faster reading when working in parallel using the format is commonly used in network communication, MapReduce algorithm. especially with the rise of REST-based web services [12]. In column chunk - data block for a column in a row recent years, JSON have been becoming popular in group. This partition is intended to speed up work with documented NoSQL databases [24] such as MongoDB, a hard disk - in this case, data is written not by rows, Couchbase, etc.