Analysis of Features and Performance of Big Data Ingestion Tools

Informatica Economică vol. 22, no. 2/2018 25 Big Data Analytics: Analysis of Features and Performance of Big Data Ingestion Tools Andreea MĂTĂCUȚĂ, Cătălina POPA The Bucharest University of Economic Studies, Romania [email protected], [email protected] The purpose of this study was to analyze the features and performance of some of the most widely used big data ingestion tools. The analysis is made for three data ingestion tools, developed by Apache: Flume, Kafka and NiFi. The study is based on the information about tool functionalities and performance. This information was collected from different sources such as articles, books and forums, provided by people who really used these tools. The goal of this study is to compare the big data ingestion tools, in order to recommend that tool which satisfies best the specific needs. Based on the selected indicators, the results of the study reveal that all tools consistently assure good results in big data ingestion, but NiFi is the best option from the point of view of functionalities and Kafka, considering the performance. Keywords: Big Data, Data ingestion, Real-time processing, Performance Functionality, Data Ingestion Tools Introduction research used to evidence the functionality 1 During the last years, the technology had a and performance of most widely tools. We big impact on the applications and in the first introduce some concepts about data processing of data, and organizations have ingestion and the importance to choose it to begun give more importance to data and process big data and we propose to do a short invest more in their collection and description for the tools used in analyze, management. Big Data created as well a new offering some information about Hadoop era and new technologies that allow analysis ecosystem. types of data like text and voice, which have a We then review existing three Apache huge volume in the Internet and in other ingestion tools: NiFi, Flume and Kafka in structures digital. The evolution of data is processing of big data and we will examine spectacular and it is very important to mention the differences between them and the strong in this paper that in the past, the volume of parts of each of them. We want to offer data was at the level of bytes and nowadays systematic information of the main the companies use a huge volume of data at functionality of three tools developed by the level of petabytes. Experts from the Apache: Flume, Kafka and NiFi used in data National Climatic Data Center in Asheville ingestion process and a detailed way how to estimated that if we want to store all the data combine the tools to improve the results for that exist in world we had need at least 1200 your requirements using for our research exabytes, but is impossible to pin down a different ways to compare the tools based on relevant number. Maybe these sizes do not performance, functionalities, the complexity. mean something for the people who do not Our analysis shows that all three tools have have a direct contact to big data, but the something special, but there is not a one and volume of data is huge and it is very difficult only tool which address all of customer’s to understand what these numbers mean. requirements and the combination of tools is V. Mayer-Schönberger and K. Cukier the answer for that problem. We examine and mentioned in [24] that “There is no good way recommend all the possible combinations to think about what this size of data means” to based on the needs of customers. prove once that big data is in a continuous The paper analyses the main characteristics of evolution and the future of it will be data ingestion tools. It provides key gloriously. The paper presents an analysis of information about typical issues of data the use of big data ingestion and present a ingestion and about the reasons why we DOI: 10.12948/issn14531305/22.2.2018.03 26 Informatica Economică vol. 22, no. 2/2018 choose the three Apache ingestion tools called “data ingestion”. According to [27], the instead others. Using a preliminary view, it is term “ingestion” means a consumption of a important to identify the common substance by an organism, in our paper the characteristics for tools. After that, the consumption of a substance is represented by analysis results will be providing. We decided data and the organism can be, for example a to examine Apache tools because Apache is database where the data are storage. very well known in developer’s area and it is Data ingestion layer represents the initial step the most used web server software, running on for the data coming from different sources, the 67% of web servers from entire world. step where they are categorized and According to the [3], “The name 'Apache' was prioritized, but it is important to note that is chosen from respect for the Native American close the toughest job in the process of big Indian tribe of Apache, well-known for their data. N. S. Gill [26] mentioned that “Big Data superior skills in warfare strategy and their Ingestion involves connecting to various data inexhaustible endurance.” sources, extracting the data, and detecting the The paper has the following structure. Section changed data”. For unfamiliar readers, data 2 introduced the concept of data ingestion ingestion can be explained like moving data with big data, the necessity of it, including a (structured or unstructured) from their origin short description for Hadoop ecosystem and a into a system where is easy to be analyzed and short paragraph where we offer information stored. In the next paragraphs of this paper we about each tool. Section 3 contends our find important information about data research based on tools, analyzing the main ingestion from different perspectives: we characteristics for NiFi, Flume and Kafka, prove the necessity of data ingestion, we note offers solid arguments why this paper use the challenges met in data ingestion, we offer them instead another developed tool and an information about parameters and key analyze based on the functionalities and principles. performance for them. Section 4 presents the To finish the process of data ingestion it is results of our research based on functionalities necessary to use a tool that is capable to and performance for the tools and a detailed support the following key principles: network explanation for each result. Section 5, the bandwidth, unreliable network, choosing conclusion section is the most important part right data format and streaming data because here we can observe the real importance of the information found in this 2.1.1 The necessity of data ingestion paper and the scope of it and contains our final In many situations, when using big data, the results. source of data structure is not known and if the companies, for example use the common data 2. Data ingestion and Hadoop ecosystem ingestion methods it is difficult to manipulate 2.1 Data ingestion concept the data. For the companies data ingestion According to [9], “in typical ingestion represents an important strategy, helping them scenarios, you have multiple data sources to to retain customers and obtain increase process. As the number of data sources profitability. increases, the processing starts to become The main advantages that demonstrate the complicated”. For a long time, data storage necessity of data ingestion are the following: does not need additional tools to process the Increased productivity. It is taking a lot of volume of data because the quantity was time for companies to analyze and to insignificant, but in last years when the move data from different sources, but with concept of big data had appeared that begin to data ingestion the process is easier and the be a problem. As we mentioned in time can be used to do something else introduction, this paper analyses a new Ingestion of data in batches or in real time. process to obtain and import data for their In batches, data are stored based on storage in a database or for immediate use periodic intervals of time DOI: 10.12948/issn14531305/22.2.2018.03 Informatica Economică vol. 22, no. 2/2018 27 Data are automatically organized and Filesystem (HDFS) and MapReduce. The structured, even if there are different big HDFS is a file system designed for data data formats or protocols. storage and processing of data. HDFS is made for storing and providing streaming, parallel 2.1.2 Data ingestion challenges access to large amount of data (up to 100s of Variance and volume of data sources are in a TB). HDFS storage is distributed over a continuously expansion. Extracting data from cluster of nodes. MapReduce is a large dataset these sources can be extremely challenging processing model. As the name suggests, it is for users, considering the required time and composed of two steps. The initial step, Map, resources. The main issues in data ingestion establishes a process for each single key of the are the following: records to be processed (key value type). The Different formats in data sources final step, Reduce, performs the operation of Applications and data sources are summing the results, processing the data evolving rapidly output from the map phase according to the Data capture and detection are time required operator or function, resulting in a set consuming of value key pairs for each single key. Validation of ingested data Swizec Teller notes in [22] that these two Data compression and transformation projects can be configured in combination before ingestion with other projects into a Hadoop cluster. A cluster can have hundreds or thousands of 2.1.3 Data ingestion parameters nodes and they can be difficult to manually The main ingestion parameters used in the configure.

Analysis of Features and Performance of Big Data Ingestion Tools

Apache Sentry

Cómo Citar El Artículo Número Completo Más Información Del

Database Licensing Information User Manual

Flume 1.8.0 Developer Guide

Full-Graph-Limited-Mvn-Deps.Pdf

Linux on Z Platform ISV Strategy Summary

Storage and Ingestion Systems in Support of Stream Processing

Apache Flume™

Technology Overview

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack

Code Smell Prediction Employing Machine Learning Meets Emerging Java Language Constructs"

Real-Time Stream Processing for Big Data