Informatica Economică vol. 22, no. 2/2018 25

Big Data Analytics: Analysis of Features and Performance of Big Data Ingestion Tools

Andreea MĂTĂCUȚĂ, Cătălina POPA The Bucharest University of Economic Studies, Romania [email protected], [email protected]

The purpose of this study was to analyze the features and performance of some of the most widely used big data ingestion tools. The analysis is made for three data ingestion tools, developed by Apache: Flume, Kafka and NiFi. The study is based on the information about tool functionalities and performance. This information was collected from different sources such as articles, books and forums, provided by people who really used these tools. The goal of this study is to compare the big data ingestion tools, in order to recommend that tool which satisfies best the specific needs. Based on the selected indicators, the results of the study reveal that all tools consistently assure good results in big data ingestion, but NiFi is the best option from the point of view of functionalities and Kafka, considering the performance. Keywords: Big Data, Data ingestion, Real-time processing, Performance Functionality, Data Ingestion Tools

Introduction research used to evidence the functionality 1 During the last years, the technology had a and performance of most widely tools. We big impact on the applications and in the first introduce some concepts about data processing of data, and organizations have ingestion and the importance to choose it to begun give more importance to data and process big data and we propose to do a short invest more in their collection and description for the tools used in analyze, management. Big Data created as well a new offering some information about Hadoop era and new technologies that allow analysis ecosystem. types of data like text and voice, which have a We then review existing three Apache huge volume in the Internet and in other ingestion tools: NiFi, Flume and Kafka in structures digital. The evolution of data is processing of big data and we will examine spectacular and it is very important to mention the differences between them and the strong in this paper that in the past, the volume of parts of each of them. We want to offer data was at the level of bytes and nowadays systematic information of the main the companies use a huge volume of data at functionality of three tools developed by the level of petabytes. Experts from the Apache: Flume, Kafka and NiFi used in data National Climatic Data Center in Asheville ingestion process and a detailed way how to estimated that if we want to store all the data combine the tools to improve the results for that exist in world we had need at least 1200 your requirements using for our research exabytes, but is impossible to pin down a different ways to compare the tools based on relevant number. Maybe these sizes do not performance, functionalities, the complexity. mean something for the people who do not Our analysis shows that all three tools have have a direct contact to big data, but the something special, but there is not a one and volume of data is huge and it is very difficult only tool which address all of customer’s to understand what these numbers mean. requirements and the combination of tools is V. Mayer-Schönberger and K. Cukier the answer for that problem. We examine and mentioned in [24] that “There is no good way recommend all the possible combinations to think about what this size of data means” to based on the needs of customers. prove once that big data is in a continuous The paper analyses the main characteristics of evolution and the future of it will be data ingestion tools. It provides key gloriously. The paper presents an analysis of information about typical issues of data the use of big data ingestion and present a ingestion and about the reasons why we

DOI: 10.12948/issn14531305/22.2.2018.03 26 Informatica Economică vol. 22, no. 2/2018 choose the three Apache ingestion tools called “data ingestion”. According to [27], the instead others. Using a preliminary view, it is term “ingestion” means a consumption of a important to identify the common substance by an organism, in our paper the characteristics for tools. After that, the consumption of a substance is represented by analysis results will be providing. We decided data and the organism can be, for example a to examine Apache tools because Apache is database where the data are storage. very well known in developer’s area and it is Data ingestion layer represents the initial step the most used web server software, running on for the data coming from different sources, the 67% of web servers from entire world. step where they are categorized and According to the [3], “The name 'Apache' was prioritized, but it is important to note that is chosen from respect for the Native American close the toughest job in the process of big Indian tribe of Apache, well-known for their data. N. S. Gill [26] mentioned that “Big Data superior skills in warfare strategy and their Ingestion involves connecting to various data inexhaustible endurance.” sources, extracting the data, and detecting the The paper has the following structure. Section changed data”. For unfamiliar readers, data 2 introduced the concept of data ingestion ingestion can be explained like moving data with big data, the necessity of it, including a (structured or unstructured) from their origin short description for Hadoop ecosystem and a into a system where is easy to be analyzed and short paragraph where we offer information stored. In the next paragraphs of this paper we about each tool. Section 3 contends our find important information about data research based on tools, analyzing the main ingestion from different perspectives: we characteristics for NiFi, Flume and Kafka, prove the necessity of data ingestion, we note offers solid arguments why this paper use the challenges met in data ingestion, we offer them instead another developed tool and an information about parameters and key analyze based on the functionalities and principles. performance for them. Section 4 presents the To finish the process of data ingestion it is results of our research based on functionalities necessary to use a tool that is capable to and performance for the tools and a detailed support the following key principles: network explanation for each result. Section 5, the bandwidth, unreliable network, choosing conclusion section is the most important part right data format and streaming data because here we can observe the real importance of the information found in this 2.1.1 The necessity of data ingestion paper and the scope of it and contains our final In many situations, when using big data, the results. source of data structure is not known and if the companies, for example use the common data 2. Data ingestion and Hadoop ecosystem ingestion methods it is difficult to manipulate 2.1 Data ingestion concept the data. For the companies data ingestion According to [9], “in typical ingestion represents an important strategy, helping them scenarios, you have multiple data sources to to retain customers and obtain increase process. As the number of data sources profitability. increases, the processing starts to become The main advantages that demonstrate the complicated”. For a long time, data storage necessity of data ingestion are the following: does not need additional tools to process the  Increased productivity. It is taking a lot of volume of data because the quantity was time for companies to analyze and to insignificant, but in last years when the move data from different sources, but with concept of big data had appeared that begin to data ingestion the process is easier and the be a problem. As we mentioned in time can be used to do something else introduction, this paper analyses a new  Ingestion of data in batches or in real time. process to obtain and import data for their In batches, data are stored based on storage in a database or for immediate use periodic intervals of time

DOI: 10.12948/issn14531305/22.2.2018.03 Informatica Economică vol. 22, no. 2/2018 27

 Data are automatically organized and Filesystem (HDFS) and MapReduce. The structured, even if there are different big HDFS is a file system designed for data data formats or protocols. storage and processing of data. HDFS is made for storing and providing streaming, parallel 2.1.2 Data ingestion challenges access to large amount of data (up to 100s of Variance and volume of data sources are in a TB). HDFS storage is distributed over a continuously expansion. Extracting data from cluster of nodes. MapReduce is a large dataset these sources can be extremely challenging processing model. As the name suggests, it is for users, considering the required time and composed of two steps. The initial step, Map, resources. The main issues in data ingestion establishes a process for each single key of the are the following: records to be processed (key value type). The  Different formats in data sources final step, Reduce, performs the operation of  Applications and data sources are summing the results, processing the data evolving rapidly output from the map phase according to the  Data capture and detection are time required operator or function, resulting in a set consuming of value key pairs for each single key.  Validation of ingested data Swizec Teller notes in [22] that these two  Data compression and transformation projects can be configured in combination before ingestion with other projects into a Hadoop cluster. A cluster can have hundreds or thousands of 2.1.3 Data ingestion parameters nodes and they can be difficult to manually The main ingestion parameters used in the configure. Hadoop cluster covers the need for comparison are the following: tools to easily and effectively configure  Data velocity- this parameter is based on systems and data. HDFS and MapReduce the speed to process data from different might be executed on separate servers. They are named Hadoop clients and security is the sources like human interaction, social main reason for physically separating Hadoop media, networks. nodes from Hadoop clients. If we are deciding  Data size- Because data ingestion works to install clients on the same servers as with huge volume of data, they are Hadoop, we will have to provide a high level generated from multiple sources to of security to every user for access. Logically increase the time and physically separating them simplifies the

 Data Frequency-This parameter can have needed configuration steps. There are many two ways to process data: in real time or sub-projects (managed mostly by Apache, batch which are made by free organizations)  Data Format-Every company choose designed for maintenance and monitoring that different format for their data and the data very well integrate with Hadoop and they lets ingestion needs to adapt for every us concentrate for developing data ingestion situation rather than monitoring it. Three of the commonly used tools for data ingestion in 2.2 Hadoop ecosystem Hadoop are rigorously described in the ecosystem is an essential following sections. supporting structure for processing and storing large amount of data (Big Data). The 2.3 Kafka Apache Hadoop ecosystem grows is a distributed, high- continuously and it consists of multiple throughput messaging system, a publish - projects and tools with valuable features and subscribe environment that provide highly benefits that provide capacity of loading, availability. With one broker handling transferring, streaming, indexing, messaging, hundreds of MB per second of reads and querying and many others. Hadoop contains writes from several clients, Kafka is a very two main elements: Hadoop Distributed

DOI: 10.12948/issn14531305/22.2.2018.03 28 Informatica Economică vol. 22, no. 2/2018 fast system. Replication is used for messages generate a file with information about the across the cluster and then stored on disk. respective run. In a day the product runs for According to [7], “Kafka can be used for thousands of times and generate a large stream processing, web site activity tracking, volume of data stored in log files and using metrics collection and monitoring, and log Flume, data can be stream into a tool for aggregation”. “It is a paradox that a key analyze, like we can see in the image below, feature of Kafka is its small number of followed by storage process of them in HDFS. features. It has far fewer features and is much We remark that in general, Flume allows users less configurable than Flume”, noted Ellen to ingest and store into Hadoop for future Friedman and Ted Dunning in [8]. Flume will analyses, data from multiple sources and with be described in the next section of this paper. different sizes, use horizontally scale to ingest They also discovered that “Kafka is similar to data, the user has the guarantee that his data Flume in that it streams messages, but Kafka are delivered based on the transactions is designed for a different purpose. between agents, use the insulate system in the While Flume is designed to stream messages situation when incoming data rate is bigger to a sink such as HDFS or HBase, Kafka is than a standard rate and it has a better designed for messages to be consumed by integrated bond with Hadoop ecosystem in several applications”. The principal elements contrast to Kafka or NiFi. of Kafka architecture are Producer, Broker, J. Kim and B. Bengfort note in [11] that the Consumer and Topic. Topics are used for data flows in Flume like a pathway which feeding of messages. Producers send the ingest data from origin to destination. Data or messages to topics and consumers, who can events are moved from source to destination subscribe to those topics and consume the based one sequence of hops and the concept is messages from that topics. Topics are named Flume agent (a JVM process) which partitioned and is attached a key to each consist three important components: channel, message and a partition is like a log. sink and the source as we can see in the below image. 2.4 Flume Source represents the part of the Agent where “Flume is a distributed, reliable, and available data are received from data generators, service for efficiently collecting, aggregating, followed by transfer them to the channels and moving large amounts of log data.” is a from Flume events, Channel can work with reliable definition found on official website different sources and sinks and are for Apache Flume. This paper analyses the represented a bridge between the sinks and the newest version for tools and the last version sources: receives the data from the source and stable for Flume is 1.8.0, the eleventh release use the buffer till they are consumed by sinks. for apache project that offer to the users a Sink represents the final component of the stable product, compatible with older versions Flume agent where the data from the channels of the Flume (1.x code line) and it is a are consumed and send to the destination (the software ready for production. This tool is data are stored in HDFS at this step). We note made to ingest and collect huge volumes of that the biggest disadvantage of using Flume data from multiple sources into Hadoop is that the data can be lose in a very easy way, Distributed File System (HDFS), the most for example if the user choose the Memory used types of data for processing are sensor channel with high throughput, when the agent and machine data, social media data, maps, node goes down the data will be lost. astronomy, aerospace, application logs or geo-location data. We observe the using of 2.5 NiFi Flume in one specific example where the tool Some systems are generating the data and is used for the logging of manufacturing other systems are consuming it. Apache NiFi operations, the log is generated in every run of is developed for the automation of this flow. the product when it comes off the line and In [16] Apache NiFi is defined as “a data flow

DOI: 10.12948/issn14531305/22.2.2018.03 Informatica Economică vol. 22, no. 2/2018 29 management system that comes with a web this top we decided that our paper will analyze User Interface that helps to build data flows in three tools which represent a main choice for real time, it supports flow-based programming companies and users. We preferred to use in and the graph programming includes our research Apache tools mentioned in this processors and connectors, instead of nodes paper because an advantage of them is the and edges”. The user connects processors possibility to combine them for a better result. together with connectors and the data will be Comparing with the other tools from data defined how to be manipulated. A strong ingestion area, we noted that the analyzed feature is Nifi’s capability of ingesting any tools from this paper have some special data using ingestion methodologies for any characteristics such as the guarantee that they particular data. Similar to inputting data. The are reliable offering zero data loss, using large output is very customizable too. T. John and volumes of data Apache Kafka and Flume P. Misra observed in [23] that “The Apache systems provide scalable, reliable and high- NiFi website states Apache NiFi as - An easy performance. In our decision we based on to use, powerful, and reliable system to criteria which convinced us that if we are put process and distribute data”. We consider that in a hypothetical situation to choose a tool for is a good alternative of Apache Flume having data ingestion we will use one of them. a vast set of features and easy to use web user The last criteria and maybe the most important interface. It is easy to set-up and it is very which helped us to decide on this choice was highly customizable. the information from articles and books such as [16], [17] and [21], based on using of this 3. The research methods tools in known application. We find that the This section contains information about the main criteria when a company wants to analysis of the main characteristic for tools: choose a tool for data ingestion are: speed to Flume, NiFi and Kafka and represents our ingest data in a rapid way, platform support analysis of the functionalities and which offers the facility to connect with data characteristics with the scope to put in stores, the facility to scale the framework to evidence the choice of them for creating the work with large datasets and the facility to content of this paper. Our analysis is based on extract and access data from sources without the comparison of them from different impact on their ability to execute transactions perspectives like performance, functionality, or performance and in our choice for NiFi, necessity. First, when we searched Kafka and Flume we used that criteria. information about data ingestion tools we found over 30 different tools used by 3.1 Functionalities of analyzed tools companies in this process and we were in the An important part of the study is the analysis situation to choose the best of them for our of tools functionalities. In this subsection we analysis. The choice of the data ingestion tool present the functionalities used for tool for a company depends on multiple factors comparison. We analyzed the following such as target, transformations (simple or indicators: reliability, system requirements, complex), data source, performance, necessity limits of the tool, stream ingest and so we used the same criteria in our research. processing, guaranteed delivery and data type. Next, we searched for articles and opinions In the following paragraph we will justify our that contained the key words like “data choice for these indicators. In the next ingestion used tool”, “first option of data paragraph, we will present the results. In the ingestion tool” and we obtained the main used result section we examine the functionality for tools for data ingestion. The final decision was each tool using a standard scale, with three based on [18] were, based on top 18 data values: complete implemented functionality, ingestion tools, Flume is on second position, partial implemented functionality and no followed by Apache Kafka and Apache NiFi, implemented functionality. first option been Amazon Kinesis. Based on

DOI: 10.12948/issn14531305/22.2.2018.03 30 Informatica Economică vol. 22, no. 2/2018

According to [25], “data collection and indicator included in our research was the transportation should be reliable with number of processed files per second. We minimum data loss”. In our research, want to note the fact that for our tools the reliability was based on the ability to deliver number of processed data per second differed events and logging them in situation when one because of the size of files. Scalability was of the tools presented have a software crash, included in our research based on what Cory scarce memory or bandwidth and the Isaacson said in [5]: “When managing a possibility to lose data can appear. We successful expanding application, the ability consider that guaranteed delivery to scale becomes a critical need. Whether you functionality is very important because user are introducing the latest new game, a highly needs to have the guarantee that his data are popular mobile application, or an online completed after data ingestion process. Limits analytics service, it is important to be able to of the tools were included in our analysis to accommodate rapid growth in traffic and data expose their disadvantages. volume to keep your users happy.”. We We want to note in this paper the fact that we consider that indicator an important one consider that from this point of view a perfect because data ingestion tools work with a huge tool does not exist and for companies can be a volume of big data and the scalability can help better solution to combine them. Another in this process. Message durability was important functionality used in our included in our research to make sure that investigation was data type because it is even if the tool dies, the task is not lost. When important for the user to know what type of one of tools quit or crash the information data can ingest with the chosen tool. about queues and messages are forgot and to make sure that messages are not lost it is 3.2 Performance measurements of the necessary to mark both the and messages selected tools queue as durable. In the result section we In performance testing using big data are examine the performance indicators for each included two main actions: data ingestion and tool using a standard scale which can have throughout where is verified how the fast three values: high, medium and slow. system can consume data from various data Bellow, we present a performance source and data processing which involves to measurements study for Flume, presented on verify the speed with which the map or queries the Flume’s official page and we used it in our can reduce jobs in execution. We note in our research because it is the most explicit study research performance measurements because found on this subject which evidence we consider that the user needs to know performance for Flume in big data ingestion information about speed, the number of process. Test was made by M. Percy in [15] processed files per minute, the acceptable size who used the following test setup: Flume for the files for the tool. In our analyze we agent was run in a single JVM on his own based on performance measurements studies physical machine and a separate client created on the tools official pages and on the machine was used to generate load in syslog opinions found in articles, from different users format against the Flume box. Data was store who use Kafka, NiFi or Flume. In this paper, by Flume onto a 9 node HDFS cluster which we used for our research in performance the was configured on a separate hardware. In this following indicators: speed, number of test, virtual machines were not used. On the processed files per second, scalability and point of view of hardware specifications CPU message durability. used was Intel Xeon L5630 2 x quad-core with For businesses can be challenging to ingest Hyper-Threading @ 2133MHz (8 physical big data at a reasonable speed or to process it cores), memory: 48GB and operation system efficiently with the scope to maintain a SuSE Linux 64-bit. For Flume configurations, competitive advantage so the speed indicator Mike Percy used 1.6.0u26 java version, one needs to be included in our analysis. Another agent, for channel used Memory Channel and

DOI: 10.12948/issn14531305/22.2.2018.03 Informatica Economică vol. 22, no. 2/2018 31 for sink HDFSEventSink with Avro event same problem as Flume for dimension of data serialization and snappy serialized (kb), it has fixed protocol, format and schema, compression. For data description he used a custom code is often needed. The last one, event size 300 bytes. According to the NiFi does not accept data replication and it is obtained results, Flume is capable to assure an not a variant to use for CEP or windowed approximate average of performance equal computations. Data type for Flume is with 70000 events/second, on a single one represented by the next file formats Sequence machine, without data loss during the test. File, DataStream or Compressed Stream, Kafka accept data type like JSon, PoJo or Java 4. The research results bean and the fastest way: arrays, Nifi uses data 4.1 The functionality comparison results object (Flow File). Conclusion for this Based on the results obtained in our research, functionality is that a tool that can process all we compare the tools from the functionality types of data does not exist and the decision point of view for a better analyze (see Table for the user depends on his needs. We consider 1). According to definition from Wikipedia that Nifi is the best option from the point of for Apache Flume, “is a distributed, reliable, view of system requirements because can run and available service for efficiently collecting, on laptop and can be used with a cluster across aggregating, and moving large amounts of log enterprise class servers, hardware and data” in our research we can prove that this memory needed depends on the size of data, functionality is completed implemented and can run on Windows, Linux, Unix or Mac OS, Flume have a fail over mechanism that can support all type of browser and requires java move the data flows on a new agent without 8 or newer. On the other hand, Flume can run exterior interventions. We consider that the only on Linux, requires java 8 or newer, best choice for this functionality is Kafka requires sufficient space on disk for sinks and because in situation when a single point channels and the agents need permission to failure data is available in contrast of Flume Write/Read directories. Kafka requires where the user cannot access events till the machines with a lot of memory on them, can disk is recovered. On the other hand, Nifi is run on Unix or Windows, requires java 8 or reliable throw definition, but in real world is newer. Using Hadoop, Flume can be used to better to combine this tool with Kafka for transfer, collect, aggregate streaming events using Kafka’s reliable data stream storage. because it is a distributed system, while Flume and Nifi represent the main choices for simple, flexible and intuitive programming data guarantee delivery in comparison with model is based on streaming data flows. Our Kafka where that is not guarantee in totality. analysis provides the fact that Flume Depending on protocol, NiFI allows supports maintains a main list of ongoing data flows. In guaranteed delivery with the mention that Kafka, messages are put into topics, which are supports most once or at least once and Flume split into partitions and this one is replicated guarantee the delivery of the Events using a across the nodes in the cluster. Kafka provide transactional approach. All of them have a a huge throughput persistent messaging which limitation for functionality “limits of the is used to scalable and allow parallel data tools” because a one and only tool does not loads for Hadoop. NiFi provides real-time exist to be addressed for all requirements or to control and it is easier to manage when run in do everything. In our analyze we obtained a a cluster the movement of data between list of limits for every tool. For Flume we source and destination. In conclusion from the obtained that when Kafka Channel is used the point of view of functionality, according to possibility to loss data appear, data are limited our analyze we consider that NiFi represents at kb dimension and the data replication does the best solution to use in a company. not exist. On the other hand, Kafka has the

DOI: 10.12948/issn14531305/22.2.2018.03 32 Informatica Economică vol. 22, no. 2/2018

Table 1. Functionality tools comparison Functionality Flume NiFi Kafka Recommended tool

Reliability Partial Partial Complete Kafka implemented implemented implemented functionality functionality functionality

Guaranteed Complete Complete Partial Flume and NiFi delivery implemented implemented implemented functionality functionality functionality

Data type Partial Partial Partial Flume, NiFi and implemented implemented implemented Kafka functionality functionality functionality

System Partial Complete Partial NiFi Requirements implemented implemented implemented functionality functionality functionality

Stream ingest Partial Complete Partial Flume, NiFi and and processing implemented implemented implemented Kafka functionality functionality functionality

Limits of the tool Partial Partial Partial Flume, NiFi and implemented implemented implemented Kafka functionality functionality functionality

4.2 The performance comparison results one of the key benefits of it is that adding a In the table 2, we did a summary for large number of consumers can be made in an performance indicators for Kafka, Flume and easy way without down time or affecting NiFi based on the results of our analysis. To performance in comparison with Flume or determine the measure for tools in our NiFi where this process cannot be made in an research we based on the functionality results easy way. For the last one indicator we where we note that Flume and Kafka have a obtained that Flume represents one of the best limit for data size (KB), so from this point of choice because it supports multiple view Flume and Kafka are the best options for interceptors chaining and data flow models the number of files processed per second and with them, flume makes event because the size of them is small. Speed of transforming and filtering very easy and tools was put to medium for all of them Kafka supports replication synchronous and because the indicator does not have a standard asynchronous based on the durability limit, it depends on the needs of the user and requirement and uses commodity hard drive. if the user wants to increase it can use tuning On the other hand, NiFi do not have support procedure. According to [14], “Kafka is a native for message processing and in this case general purpose publish-subscribe model the tools need to integrate with other event messaging system, which offers strong processing frameworks to complete the job. In durability, scalability and fault-tolerance conclusion our choice from the point of view support.” we consider that this tool is the best of performance indicators is Kafka because it option for scalability in comparison with NiFi obtained good results for processing, message and Flume which are distributed, reliable, and durability and scalability. available systems. Kafka is very scalable and

DOI: 10.12948/issn14531305/22.2.2018.03 Informatica Economică vol. 22, no. 2/2018 33

Table 2. Performance indicators Performance indicator Flume NiFi Kafka The best choice

Speed Medium Medium Medium All of them

Number of files processed per second High Medium High Flume and Kafka

Scalability Medium Medium High Kafka

Message durability High Low High Flume and Kafka

5. Conclusions [6] D. Zburivsky, Hadoop Cluster The complexity and volume of data generated Deployment, 2013 by human and machines activity is increasing [7] D. Vohra, Practical Hadoop Ecosystem: continuously. This paper presented an A Definitive Guide to Hadoop-Related analysis of the use of big data ingestion and Frameworks and Tools, Apress, 2016 the process of ingesting the variety, volume [8] E. Friedman , T. Dunning, Real-World and veracity of Big Data. Hadoop, O'Reilly Media, 2015 After introducing the concept of data [9] H. Shah, N. Sawant. Big Data Application ingestion with Big Data, the necessity of it, Architecture Q&A: A Problem - Solution and realizing a short description for Hadoop Approach, Apress 2013 ecosystem and a description of three of most [10] J. Kreps, Putting Apache Kafka to widely tools for big data ingestion, we Use. A Practical Guide to Building a examined these three Apache tools NiFi, Streaming Platform, Retrieved from: Flume and Kafka in order to determine the https://www.confluent.io/blog/stream- common characteristics and analyze the parts data-platform-1/, 2015 of each other according performance and [11] J. Kim, B. Bengfort, Data Analytics functionality. This research showed that in with Hadoop, O'Reilly Media, Inc., 2016 terms of performance, Kafka offers the best [12] J. Shan, Why Big Data is a big deal, results, and in terms of functionality, Nifi is 2014 the best option. [13] K. Noyes, "How Apache Kafka is greasing the wheels for big data," References Computerworld, Nov. 1, 2015 [1] A. Holmes, Hadoop in Practice, Second [14] L. Jiang, Flume or Kafka for Real- Edition, Manning, 2014 Time Event Processing. Retrieved from [2] Ali-ud-din Khan M., M. Fahim Uddin, N. https://www.linkedin.com/pulse/flume- Gupta, "Seven V’s of Big Data," in Proc. kafka-real-time-event-processing-lan- of Zone 1 Conference of the American jiang, 2015 Society for Engineering Education, 2014 [15] M. Percy, Flume NG Performance [3] Apache Hadoop. (2018). What Is Apache Measurements. Retrieved from Hadoop? Retrieved from https://cwiki.apache.org/confluence/displ http://hadoop.apache.org/ ay/FLUME/Flume+NG+Performance+M [4] B. Bengfort, J. Kim, Data Analytics with easurements, 2012 Hadoop: An Introduction for Data [16] N. Kumar, P. Shindgikar, Modern Big Scientists, O'Reilly Media, 2016 Data Processing with Hadoop, Pack [5] C. Isaacson, Understanding Big Data Publishing, 2018 Scalability: Big Data Scalability Series, [17] N. Garg, Learning Apache Kafka - Part I, p. 30, Prentice Hall, 2014 Second Edition, Packt Publishing, 2015

DOI: 10.12948/issn14531305/22.2.2018.03 34 Informatica Economică vol. 22, no. 2/2018

[18] Predictive Analytics Today. (2017). [23] T. John , P. Misra, Data Lake for Top 18 data ingestion tools. Retrieved Enterprises, Packt Publishing, 2017 from [24] V. Mayer-Schönberger, K. Cukier, Big https://www.predictiveanalyticstoday.co Data: A Revolution That Will Transform m/data-ingestion-tools/ How We Live, Work, and Think, Eamon [19] R. Manivannan. Using Kylo for Self- Dolan/Mariner Books, 272 p., ISBN: 978- Service Data Ingestion, Cleansing,and 0544227750, 2014 Validation, 2017 [25] Y. Yang, Data ingestion overview. [20] S. Ornes, Explainer:Understanding Retrieved from size data. Retrieved from https://medium.com/@Pinterest_Enginee https://www.sciencenewsforstudents.org/ ring/scalable-and-reliable-data-ingestion- article/explainer-understanding-size-data, at-pinterest-b921c2ee8754, 2017 2013 [26] N. S. Gill, Data Ingestion, Processing [21] S. Hoffman, Apache Flume: and Architecture layers for Big Data and Distributed Log Collection for Hadoop IoT, Second Edition, p. 56-70, Packt https://www.xenonstack.com/blog/big- Publishing, 2015 data-engineering/ingestion-processing- [22] S. Teller, Hadoop Essentials, Packt big-data-iot-stream/, 2017 Publishing, 2015 [27] Ingestion – Wikipedia, https://en.wikipedia.org/wiki/Ingestion

Andreea MĂTĂCUȚĂ has graduated the Faculty of Cybernetics, Statistics and Economic Informatics from the Bucharest University of Economic Studies in 2016. She graduated the master at the same faculty, in Economic Informatics and this year (2018) she got her master diploma. Currently she has a full job as programmer using Java at the Axway Company and in the past she worked on C#. Her main focus is to learn more about programming and to become better on his domain. About her personal life she is passionate about technology, reading, pets, sport. Catalina POPA has graduated the Faculty of Electronics, Telecommunications and Information Technology from Polytechnic University of Bucharest in 2016. She graduated in June 2018 the master of Economic Informatics at Faculty of Economic Cybernetics, Statistics and Informatics from Academy of Economic Studies. Currently she is a Business Intelligence Application Administrator at Orange, in IT Services Operations Department. Her work focuses on improving the monitoring systems by adding and modifying Shell and PL/SQL scripts

DOI: 10.12948/issn14531305/22.2.2018.03