The Faculty of Health, Science and Technology Computer Science

Pontus Sj¨oberg, Lina Vilhelmsson

Implementation and Evaluation of a Data for Industrial IoT Using Apache NiFi

Bachelor’s Project 2020:06

Implementation and Evaluation of a Data Pipeline for Industrial IoT Using Apache NiFi

Pontus Sj¨oberg, Lina Vilhelmsson

c 2020 The author(s) and Karlstad University

This report is submitted in partial fulfillment of the requirements for the Bachelor’s degree in Computer Science. All material in this report which is not my own work has been identified and no material is included for which a degree has previously been conferred.

Pontus Sj¨oberg

Lina Vilhelmsson

Approved, June 01, 2020

Advisor: Prof. Andreas Kassler

Examiner: Per Hurtig

iii

Abstract

In the last few years, the popularity of Industrial IoT has grown a lot, and it is expected to have an impact of over 14 trillion USD on the global economy by 2030. One application of Industrial IoT is using data pipelining tools to move raw data from industry machines to data storage, where the data can be processed by analytical instruments to help optimize the industrial operations. This thesis analyzes and evaluates a data pipeline setup for Industrial IoT built with the tool Apache NiFi. A data flow setup was designed in NiFi, which connected an SQL database, a file system, and a Kafka topic to a distributed file system. To evaluate the NiFi data pipeline setup, some tests were conducted to see how the system performed under different workloads. The first test consisted of determining which size to merge a FlowFile into to get the lowest latency, the second test if data from the different data sources should be kept separate or be merged together. The third test was to compare the NiFi setup with an alternative setup, which had a Kafka topic as an intermediary between NiFi and the endpoint. The first test showed that the lowest latency was achieved when merging FlowFiles together into 10 kB files. In the second test, merging together FlowFiles from all three sources gave a lower latency than keeping them separate for larger merging sizes. Finally, it was shown that there was no significant difference between the two test setups.

v Acknowledgements

We want to thank our mentor at Karlstad University, Andreas Kassler, for helping us with writing our report and guiding us through the project. We also want to thank Erik Hallin, our mentor at Uddeholm AB, for helping and guiding us with all the different tools and we used throughout the project, and giving us insight in how our implementation might be used in an industry context. Lastly, we want to thank Uddeholm AB for letting us do this project for them.

vi Contents

1 Introduction 1

2 Background 3 2.1 Introduction ...... 3 2.2 Concepts ...... 3 2.2.1 Industrial Internet of Things ...... 3 2.2.2 Data Pipelining ...... 4 2.2.3 Data Streaming ...... 5 2.3 ...... 6 2.3.1 Topics ...... 6 2.3.2 Cluster ...... 6 2.3.3 Producers ...... 7 2.3.4 Consumers ...... 7 2.4 Apache NiFi ...... 8 2.4.1 Primary Components ...... 9 2.4.2 Extensions ...... 11 2.4.3 Security ...... 12 2.4.4 Cluster ...... 13 2.4.5 Compatibility ...... 14 2.5 NiFi as a Producer and Consumer for Kafka ...... 14 2.5.1 MiNiFi ...... 14 2.5.2 NiFi as a Producer ...... 15 2.5.3 NiFi as a Consumer ...... 16 2.6 Related Tools ...... 16 2.6.1 Apache Airflow ...... 16 2.6.2 ...... 17

vii 2.6.3 ...... 17 2.6.4 Azure Data Factory ...... 18 2.6.5 Logstash ...... 18

3 Data Pipelining Architecture and Prototype 19 3.1 Introduction ...... 19 3.2 Current Setup ...... 19 3.3 Why Bring in NiFi? ...... 21 3.4 New Pipelining Setups ...... 23 3.4.1 New Setup ...... 23 3.4.2 Alternative New Setup ...... 24 3.5 NiFi Processors Used ...... 26 3.5.1 Consuming from Kafka Topic ...... 27 3.5.2 Getting Files from File System ...... 27 3.5.3 Getting Data from MariaDB ...... 28 3.5.4 Other Processors ...... 29

4 Experimental Setup 31 4.1 Introduction ...... 31 4.2 Additional Software Used ...... 32 4.2.1 and HDFS ...... 32 4.2.2 MariaDB ...... 33 4.3 Compute Nodes ...... 34 4.3.1 Node 1 ...... 34 4.3.2 Node 2 ...... 35 4.3.3 Node 3 ...... 35 4.4 Experiment Description ...... 35 4.4.1 Performance Metrics ...... 35

viii 4.4.2 Test Descriptions ...... 37

5 Results & Evaluation 40 5.1 Introduction ...... 40 5.2 Results of Test 1 ...... 40 5.3 Results of Test 2 ...... 44 5.4 Results of Test 3 ...... 46 5.5 Conclusion of Results ...... 48

6 Conclusions 50 6.1 Project Summary and Evaluation ...... 50 6.2 Future Work ...... 51

References 54

A Appendix 59 A.1 Python Script for Processing Kafka Messages ...... 59 A.2 SQL Script for Loading Rows into MariaDB ...... 59 A.3 Software Download Links ...... 59 A.4 Raw Data ...... 59 A.5 Pictures ...... 60

ix List of Figures

2.1 A simplified view of the Kafka architecture ...... 7 2.2 NiFi’s GUI ...... 9 2.3 A simplified view of the NiFi architecture ...... 9 2.4 A simplified view of a NiFi Cluster ...... 14 3.1 The current setup at Uddeholm AB ...... 20 3.2 New setup with NiFi sending data directly to HDFS ...... 24 3.3 Alternative new setup with NiFi sending data to HDFS through Kafka . . 25 3.4 The processors used in NiFi for the new setup ...... 26 4.1 The three compute nodes used for the experiment...... 34 4.2 NiFi data flow for the second test...... 38 4.3 NiFi data flow for alternative new setup used for the third test...... 39 5.1 Average FlowFile latency for different merging sizes...... 40 5.2 Percentage of the total average FlowFile latency made up by the time be- tween MariaDB and NiFi, before being sent to HDFS...... 42 5.3 Average FlowFile latency for different merging sizes...... 42 5.4 Latency distribution for different merging sizes...... 42 5.5 Average FlowFile latency for different merging sizes, comparing separate and combined merging...... 44 5.6 Latency distribution for combined and separate merging...... 44 5.7 Average throughput for the two different sources with different amounts of 1 kB sources...... 46 5.8 Average FlowFile latency for the two different setups...... 47 5.9 Latency distribution for the two different setups...... 47 A.1 Full-size version of Figure 3.4 ...... 60 A.2 Full-size version of Figure 4.2 ...... 61 A.3 Full-size version of Figure 4.3 ...... 62

x List of Tables

4.1 Intervals for achieving different merging sizes ...... 37

xi

1 Introduction

The Industrial Internet of Things, or Industrial IoT, is a subset of Internet of Things (IoT) specific for an industrial use, and it covers the machine-to-machine and industrial com- munication parts of IoT. [1] Industrial IoT has grown a lot in the past few years, and it is expected by some to have an impact on the global economy of over 14 trillion USD by the year 2030. [2] Industrial IoT focuses on integrating and interconnecting already exist- ing devices, whereas ”consumer” IoT (e.g. smart devices) focuses more on creating new devices. An example of an Industrial IoT application is collecting large amounts of data from industrial machines, and sending this data to various analytical tools which then can optimize the industrial operations, based on how the machines are performing currently. [1, 3] One way this can be done is by the use of data pipelining tools, which are tools for moving data from one place to another. [4]

In this project, we will try to evaluate data pipelining and data flows in the context of Industrial IoT by creating data pipelining setups in the tool Apache NiFi (or, simply NiFi) and try to find the best way to include NiFi into an architecture where data needs to be moved from several starting points into a cloud-based file system. To evaluate this setup, we will also be testing the performance of the data flow setup in NiFi with different configurations and workloads. Currently, there are very few scientific papers available that test the performance of a NiFi data flow. [5, 6] Therefore, the result of this project is interesting to the task provider Uddeholm AB, as they are looking into using NiFi as part of their data streaming architecture. The specific tests performed in this project are designed with Uddeholm AB in mind, to answer the questions they have about the performance of a NiFi data flow setup.

1 The disposition of the report is as follows: In Chapter 2, some background to the technologies, concepts, and tools used is given. The technologies and concepts described in this chapter are Industrial IoT, data pipelining, and data streaming. The tools Apache Kafka and Apache NiFi are also described in detail in this chapter, along with some shorter information about some alternative data pipelining tools. Chapter 3 describes the data pipelining prototype designed and implemented in the project. The ideas behind the design and why Apache NiFi is used are explained. The new setup is described both in a more abstract view, and more specifically by looking at how the NiFi flow was set up. Chapter 4 presents the experiments that were performed to test the NiFi implementation. A more specified description of the setup is given, explaining the exact software that was used. The setups for the three tests that were performed are described. Chapter 5 presents the results of the tests described in part 4, along with evaluations of the results. Finally, Chapter 6 gives a conclusion of the report. Possible future work is also proposed here.

2 2 Background

2.1 Introduction

This chapter gives a background to the technologies and the main tools used in the project. The chapter starts with part 2.2, which goes through some of the important concepts and technologies that are relevant to the project. Part 2.3 gives a description of Apache Kafka, and part 2.4 describes Apache NiFi. Part 2.5 looks at the possibilities of combining NiFi and Kafka. Part 2.6 introduces some other related tools.

2.2 Concepts

2.2.1 Industrial Internet of Things

The Industrial Internet of Things, or the Industrial IoT, is a subset of the Internet of Things, IoT, which is an umbrella term for systems of interconnected computing devices and machines which can transfer data to each other without human-to-human or human- to-computer interaction. The popularity of IoT has increased in the last few years, with the rise of ”smart” consumer devices such as smartphones, the Apple Watch, and Amazon Echo. [7] This category of devices, while often colloquially called simply ”IoT” devices, is only a part of all of IoT, and is classified in [1] as consumer IoT, to distinguish it from Industrial IoT. The Industrial IoT subset covers the machine-to-machine and industrial communication parts of IoT. Industrial IoT is about connecting the domains of operational technology with information technology (IT) domains. For example, large amounts of data collected from industrial machines can be connected to analytical instruments which can help optimize the industrial operations. While consumer IoT focuses on creating new devices, Industrial IoT focuses more on already existing devices, and how to integrate and interconnect these with each other. There is also a difference in the amount of generated data from consumer IoT and Industrial IoT. Consumer IoT data volumes are dependent

3 on the application, whereas Industrial IoT data is meant for analytics, which leads to Industrial IoT generating very large amounts of data, up to several terabytes of data every minute. [1, 3] The architecture of Industrial IoT systems can be visualized as being built up of four layers. The topmost layer is the content layer, which encompasses the user interface of the system, such as a screen. The next layer is the service layer, which holds applications that can perform operations on data which then can be displayed in the content layer. Next is the network layer, which consists of physical network buses and communication protocols, which aggregate data and then transport it to the service layer. The bottom layer is the device layer, which contains the physical (hardware) components of the system, such as sensors, machines, and cyber-physical systems (CPS). [8]

2.2.2 Data Pipelining

Data pipelining is an implementation technique where instructions are carried out in par- allel instead of one after the other. In a system which does not use pipelining, each item will have to go through all of the instructions in the system before the next item can enter into the system. When using pipelining, each item still has to go through all the necessary steps and instructions, but as soon as the first item moves from the first to the second instruction, the next item in line can go through the first instruction. By using pipelining, the latency for each individual item going through the system will stay the same, however, the throughput of a system will be increased. [9] Another aspect of pipelining is that the output of one (or instruction) is the input of the next process in the pipeline. Just like with a real-life pipe built up out of multiple different segments, the data pipeline brings data from the input through multiple processes which all feed into the next process in line, until the data finally reaches the output point. [4] The latter aspect of pipelining, where the output of one process becomes the input of

4 the next process, is the main function that a software pipelining tool provides. There are many tools on the market for data pipelining, all with different strengths and weaknesses. [10, 11, 12, 13, 14] This thesis will focus primarily on Apache NiFi and secondarily on Apache Kafka, but some alternative tools are described in part 2.6. With the use of data pipelining tools, raw data can be moved from one system to another. Usually, this means moving data from sensors, databases, etc., and putting the data into a data lake, the cloud, or some other data storing solution. Here, the data will be stored and analyzed. This analysis can be performed with machine learning algorithms, which can notice patterns and anomalies in the data. This is an example of how data pipelining tools can be used for Industrial IoT. [1, 15]

2.2.3 Data Streaming

Data streaming is the process of data being continuously generated by many different sources, where each data point usually is quite small (in the order of kilobytes). Streamed data can be processed with techniques, where each piece of data in the stream has operations done to it while the data is being streamed from one point to another. Streaming data is a good technique to use when the goal is to continuously analyze dynamic data and make decisions based on how the streamed data is behaving, for example with machine learning algorithms. [16] Stream processing can be compared to batch processing, which is a method where jobs or tasks are processed in batches, instead of one after the other as with stream processing. [17]

5 2.3 Apache Kafka

Apache Kafka is a distributed data streaming platform developed by LinkedIn in 2011. [18] Kafka uses publish-subscribe techniques to stream messages. In Kafka, the publishers are called producers, and the subscribers are known as consumers. The producers and consumers are connected through a server, in Kafka terms called a broker, which takes in the messages from the producers and sends them to the appropriate consumer(s). The messages are stored and organized in different topics within the broker. Producers send messages to the topics, and consumers read messages from the appropriate topic. [19, 20]

2.3.1 Topics

Topics are categories into which record streams are published in Kafka. Topics are multi- subscriber, which means that each topic can have zero, one, or more consumers that subscribe to the contents. Each topic has a partitioned log, which is an ordered sequence of records to which new records are appended. This partitioning allows a topic to have a log that is larger than what can fit on a single server. The records in each partition are given unique sequence id’s called offsets, to be able to uniquely identify each record within a partition. All published records within a topic are stored for a specified amount of time, during which it can be read by consumers. This period of time is called the retention period. After this time is up, the record is discarded in order to free up space in the topic. [19]

2.3.2 Cluster

One or more brokers build up a Kafka cluster. The basic architecture of Kafka can be seen in figure 2.1. The brokers are managed by a ZooKeeper, another Apache product, which also counts as part of the cluster. Apache ZooKeeper is a service for coordinating distributed systems like Apache Kafka. ZooKeeper provides a shared hierarchical name space for coordination between distributed processes. [21]

6 Figure 2.1: A simplified view of the Kafka architecture

2.3.3 Producers

A producer in Kafka can publish data to whichever topic(s) they choose. The producer also chooses which record should be assigned to which partition within the chosen topic. This can be done in different ways, either using Round Robin to distribute the load evenly, or based on some semantic partitioning function (e.g. based on some key in the record). [19]

2.3.4 Consumers

A Kafka consumer can subscribe to one or more topics, and get the data that is published to these topics by the producers. Consumers are divided into consumer groups, and within these groups, only one consumer gets the records it subscribed to, and the record will then be distributed to all members of the consumer group. Commonly, the consumers are grouped together as ”logical subscribers”, i.e. all consumers which are subscribed to the same set of topics create a consumer group together. Each group will usually contain many consumer instances, which leads to higher fault tolerance. [19]

7 2.4 Apache NiFi

Apache NiFi is an open-source platform for managing data flows and data pipelines through a system. Apache NiFi started out as Niagarafiles, developed by the American National Security Agency (NSA) in 2006. NiFi was donated to the Apache Software Foundation (ASF) in 2014, and in 2015 it became a Top Level Project for ASF. The design of the NiFi platform design is based on the flow-based programming (FBP) model. The FBP model offers features that include the ability to run within clusters, security with TLS encryption, extensibility and improved usability features. [22] NiFi was created with IoT in mind, and is often used in an Industrial IoT context. [23, 24, 25, 26, 27] NiFi’s core is a data flow that is built on the main concepts of FlowFiles, FlowFile Processors, and Connections. FlowFiles represent each object moving through the data flow. Each FlowFile has a UUID (Universally Unique Identifier), a file name, a size, and can also have content, although this is not necessary. FlowFile Processors is where the actual work in NiFi is performed. The processors get access to FlowFiles, their attributes and their content. Processors are building blocks for NiFi data flows, and they are the most used component in NiFi. Processors can be used, for example, to create, delete, modify, or inspect FlowFiles before sending them to the next processor or to an endpoint outside NiFi. Processors can be grouped together into Process Groups, which are similar to subnets in the FBP model. Each process group is a set of processors and their connections which can receive and send data through input and output ports. By using process groups, creation of new components is done simply by combining other components. Connections provide linkage between processors, and they act as queues which in turn allow processors to interact at different rates. [23] NiFi is run through a graphical user interface (GUI) which makes it easy to visualize the flow components (see figure 2.2), unlike some other similar tools (such as Apache Kafka) which only use the command line interface. The GUI uses drag-and-drop techniques for building up the data flow with processors and connections.

8 Figure 2.2: NiFi’s GUI

Figure 2.3: A simplified view of the NiFi architecture

2.4.1 Primary Components

NiFi is a Java program that is run on a (JVM) on the hosting server. NiFi’s primary components on the JVM are the Web Server, the Flow Controller, Extensions (or extension points), the FlowFile Repository, the Content Repository, and the Provenance Repository (see figure 2.3). The Web Server hosts NiFi’s HTTP-based commands and control API. The Flow Controller is the component that provides threads for extensions to run on.

9 It manages when extensions receive resources and when they execute. The Flow Controller acts as a broker between processors, facilitating the exchange of FlowFiles. The extensions will be described in further detail in part 2.4.2.

The FlowFile Repository is the component that allows NiFi to keep track of the state of what is known about the FlowFile that is currently active in the data flow. It acts as a Write-Ahead-Log of the metadata of each FlowFile that currently exists in the system. This metadata includes the attributes of a FlowFile, a pointer to the content of the FlowFile (in the Content Repository, more about this repository later in this section) and the state of the FlowFile. Each change of a FlowFile is logged in this repository before it is executed. With this information in the Write-Ahead-Log, NiFi is able to handle restarts and unexpected system failures. NiFi resumes where it was stopped by checking the Write-Ahead-Log along with the snapshot of the FlowFile in the Provenance Repository. The default approach of the FlowFile Repository is to have the Write-Ahead Log on a specified disk partition but the repository itself is pluggable (can be plugged in and out at will).

The actual content of a FlowFile which is currently in the NiFi flow is located in the Content Repository. This content is stored locally on disk and is only read to the JVM memory when needed. By doing this, the producer and consumer processors do not need to hold on to objects in memory. When the content of a FlowFile is no longer in use, the content will be deleted or archived. Archiving of content can be enabled or disabled in the nifi.properties file. If the content is archived it will remain in the Content Repository until a certain amount of time has passed, a maximum archiving time which is set in nifi.properties. If the Content Repository is taking up too much space, content that is archived or marked as no longer in use will be deleted, this is also maintained in nifi.properties. The Content Repository is also pluggable but by default the implementation of the Content Repository is a simple mechanism where blocks of data are stored in the file system. To help reduce conflict on any single volume, more than one file storage location can be specified to get different physical partitions.

10 In the Provenance Repository, NiFi stores all the provenance event data (metadata). Provenance is a record of all that has happened to a FlowFile, i.e. the history of the FlowFile. A new provenance event is created every time an event occurs to a FlowFile. These provenance events give a snapshot of the FlowFile at that time. All of the attributes and pointers to the FlowFile’s content are copied and stored in the provenance event, along with the state of the FlowFile. The state can include information about the FlowFile’s relationship with other provenance events, among other things. The provenance events are stored in the Provenance Repository until they expire. The time until they expire is specified in the nifi.properties file. The construct of the repository is pluggable but by default it is implemented in such a way that it uses one or more physical disk volumes. Event data is indexed and searchable within each location. [23, 28]

2.4.2 Extensions

As mentioned previously, NiFi has a number of extension points. These extension points give developers the ability to add features to the platform so that NiFi meets their needs. The most common extension points, along with the already described processors, are the ReportingTask, the ControllerService, the FlowFilePrioritizer, and the AuthorityProvider. [29]

The ReportingTask interface is the mechanism that allows NiFi to publish metrics, monitoring information and internal NiFi states to endpoints. Endpoints can be log files, e-mail, or remote web services, for example.

The ControllerService gives shared state and functionality to processors, other Con- trollerServices and ReportingTasks in a single JVM. This could for example be used when loading a large data set into memory. By using a ControllerService the data set can be loaded once and given to all Processors instead of each Processor individually loading the dataset. The ControllerService is also used if there is a need to establish a connection to an external server.

11 The FlowFilePrioritizer interface provides prioritizing and sorting for FlowFiles that are placed in a queue so that they can be processed in the order that is most effective for that use case.

The AuthorityProvider is used for determining privileges and roles for a given user, if there are any.

2.4.3 Security

NiFi can ensure a secure connection by using SSL, SSH, HTTPS and encryption of content. [30] NiFi uses two-way SSL in the data flows from system to system but also from user to system. In system-to-system cases NiFi uses two-way SSL in every exchange of the data flow by encrypting and decrypting through shared keys on each side for both the sender and the recipient. For user-to-system security, NiFi uses two-way SSL authentication to be able to authorize a user and control the correct level of access for the user (read-only, data flow manager, or admin). [23] NiFi also supports authentication of users through client certificates with login and password instead of the built-in SSL authentication. This can be done by a ”Login Iden- tity Provider”, which is a pluggable mechanism for authenticating users with login and password. Which Provider to use is configured in the nifi.properties file. The au- thentication can currently be done through Apache Knox, Kerberos, OpenID connect, or Lightweight Directory Access Protocol. [31] NiFi also provides Multi-Tenant Authorization. Multi-Tenant Authorization means that the authority level of a data flow is applied to each component of that data flow. Once authenticated, users are split up into groups of users (tenants) that can command, control or observe the data flow with different levels of authorization. If a user tries to view or modify a NiFi resource, the system will check if the user has privileges to perform that action. The privileges are determined by policies and these policies are managed by an authorizer. The authorizers are configured with two properties in the nifi.properties

12 file, the nifi.authorizer.configuration.file property and the nifi.security.user property.

The first property specifies the configuration file and it is here that the authorizers are defined. By default the configuration is set so that the authorizers.xml file is selected. The authorizers.xml file is used to configure and define authorizers, the default authorizer is the StandardManagedAuthorizer. This authorizer contains a UserGroupProvider and a AccessPolicyProvider. These providers are used to load and optionally configure the users, groups and access policies. The StandardManagedAuthorizer will make all of the access decisions based on the these policies.

The second property indicates which of the authorizers that are configured in the authorizers.xml file should be used. [31, 32]

2.4.4 Cluster

Like Apache Kafka, NiFi can operate within a cluster, and NiFi does this by using a Zero- Master Clustering paradigm. Zero-Master Clustering entails that each node in the cluster performs the same tasks but on different sets of data. NiFi uses an embedded ZooKeeper, see figure 2.4, which chooses a node as the Cluster Coordinator. The Cluster Coordinator is in turn responsible for connecting and disconnecting nodes. Each node in the cluster reports heartbeat and status of the node to the Cluster Coordinator, and the Cluster Coordinator disconnects a node if it does not report any heartbeat status for a set amount of time. Each cluster also has a Primary Node. The Primary Node is able to run isolated processes. Since the rest of the nodes run all of the tasks in the data flow, the Primary Node can run a process in isolation. It is possible to configure in a processor whether to execute on all nodes in the cluster, or only on the Primary Node. Any potential fail-over is handled automatically by the ZooKeeper. [29, 31]

13 Figure 2.4: A simplified view of a NiFi Cluster

2.4.5 Compatibility

Through its processors, NiFi is compatible with many other tools. NiFi can work against several different databases, both SQL (like , InfluxDB, and MariaDB) and NoSQL (MongoDB, Couchbase, DynamoDB, and HBase). NiFi can read from and write to both regular file systems (both remotely through SSH and locally) and the distributed file system HDFS. [23] As described in part 2.5, NiFi can use Kafka as both a sink and source for data.

2.5 NiFi as a Producer and Consumer for Kafka

NiFi can work as both a producer and a consumer for Kafka by implementing the processors PublishKafka and ConsumeKafka respectively. The NiFi sub-project MiNiFi can be used as an alternative to just using PublishKafka. [33]

2.5.1 MiNiFi

MiNiFi is a sub-project of Apache NiFi. Whereas NiFi is implemented in a data center, MiNiFi is implemented at the ”edge”, i.e. close to where the data is created at sensors or an IoT implementation. MiNiFi does not have its own UI, instead the flows are created in

14 NiFi and then exported into MiNiFi. MiNiFi is much smaller than NiFi, and takes up less than 100 MB of space, whereas NiFi needs several GB of space. [34]

2.5.2 NiFi as a Producer

NiFi as a producer will take a FlowFile as input and forward it to a topic at a Kafka broker. The main way to have NiFi as a Kafka producer is by using the PublishKafka processor. PublishKafka takes the contents of a FlowFile and puts it into a Kafka topic using the KafkaProducer API. The contents of the FlowFile are converted to a Kafka message. The PublishKafka processor includes an optional Message Demarcator. The demarcator is used to decide where the separation between messages should be, and in this way it can decide if the contents of a FlowFile should be sent as several messages or as one large message. One large message will be the default approach if the Message Demarcator is not set. PublishKafka will distribute the messages to a Kafka topic following the Round Robin principle between partitions, depending on the number of partitions. If some messages for a given FlowFile fail to send but some are successfully sent, the entire FlowFile will be considered a failed FlowFile. The messages that fail to send will have a separate attribute set which index the last message that was successfully ACKed by Kafka. This attribute will allow PublishKafka to re-send the messages that have not been ACKed by Kafka. [33, 35]

It is also possible to implement the PublishKafka processor in MiNiFi. This can be used to get data more directly from the source to Kafka, instead of having it go through the NiFi center.

Lastly, MiNiFi and NiFi can be combined, where MiNiFi delivers the data from the source, and NiFi then uses the PublishKafka processor to publish messages to a Kafka topic. [33]

15 2.5.3 NiFi as a Consumer

NiFi can also act as a Kafka consumer. In this case, the NiFi process ConsumeKafka replaces the Kafka consumer and will handle all the data from the chosen Kafka topic(s) and deliver it to where it needs to go. No code needs to be written to implement this, you drag and drop the consumer process ConsumeKafka into the NiFi workflow. When Kafka sends a message to NiFi, the ConsumeKafka process emits a FlowFile where the content of the FlowFile is the content of the Kafka message. The ConsumeKafka processor also includes an optional Message Demarcator. Unlike in the PublishKafka case, the demarcator in the ConsumeKafka processor indicates that all of the messages that are received in a single poll should be produced as one FlowFile, with the demarcator separating the messages from each other. If the demarcator property is left blank, ConsumeKafka will produce one FlowFile for each message received. [33, 36]

2.6 Related Tools

There are several different tools that fill similar functions as Apache NiFi, other than Apache Kafka. This section will describe some of these tools.

2.6.1 Apache Airflow

Apache Airflow is a platform for creating and managing workflows through Python scripts. Airflow displays the workflows as Directed Acyclic Graphs (DAGs) of tasks, which can be easily modified through command line utilities. Since Airflow uses Python as a pro- gramming language, libraries and classes can easily be imported for easier creation and managing of workflows. Airflow also has a web UI which provides insight into the logs and status of tasks in the workflow. Airflow has many built-in integrations with tools such as Apache Hive, Spark, HDFS, MySQL, etc. Apache Kafka is not among these integrations, however there are still ways to connect the two platforms, with a bit of work. Overall, it

16 is possible to do a lot of things with Airflow, but the user is required to write the code for their workflows themselves. Airflow is not classified as a data streaming tool, since the tasks do not send data between each other. [10]

2.6.2 Apache Spark

Apache Spark is a cluster computing system, which provides high-level API:s in several different programming languages, and supports general execution graphs. Spark provides many libraries, which allow for work with SQL, machine learning (ML), streaming, and more. These different libraries can also easily be combined in Spark applications. The Spark API has an extension called Spark Streaming, which is specially created for scalable, high-throughput, and fault-tolerant stream processing of data streams. Spark Streaming can ingest data from Kafka, Twitter, TCP sockets, and more, and can push the data to file systems such as HDFS, databases and dashboards. Spark Streaming has no UI, and programs are written in Scala, Java or Python. [11, 37]

2.6.3 Apache Storm

Apache Storm is used for real-time processing of data streams. Storm uses vertices in the form of ”spouts” and ”bolts”, and edges in the form of ”streams” to visualize DAGs. Spouts are the sources of the streams, and will generally read from any queuing system, like Apache Kafka. It is also possible to configure a spout to create its own stream, or read from something such as a Twitter streaming API. Bolts will then process any number of input streams and produce output streams. Inside the bolts is where most of the computation logic happens in Storm. The bolts can communicate with any database system. Storm can be used with any programming language, and it can reliably process boundless streams of data. Use cases for Apache Storm includes ETL (Extract, Transform, Load), real-time analytics, and online ML. [12, 38]

17 2.6.4 Azure Data Factory

Azure Data Factory is a Microsoft service which is designed to integrate different data sources. It is a platform for managing data in the cloud and on-premises. It is used to integrate data from storage systems to data-driven workflows for ETL and ELT (Extract, Load, Transform). Data Factory is mainly used to load the data to Microsoft’s own Azure SQL databases. Data Factory does not have any support for writing code to run. It also lacks the ability to add processes and is very limited to the available tools in Data Factory. Unlike the other tools presented, no free version of Data Factory is available. The price of Data Factory depends on how many pipelines are orchestrated and executed, data flow executions and debugging, and the number of Data Factory operations that are being used. [13, 39]

2.6.5 Logstash

Logstash is an open-source data processing pipeline from Elastic. Logstash ingests data from a source (like Kafka, a file, or GitHub, to mention a few), transforms it in the way you want, and then transports the data (or events) to a ”stash”. There are many different stashes, for examples sending the events to a TCP socket, publishing the data to a websocket, writing events to a Kafka topic, or writing the events to a file on disk. Logstash can be used as a pipelining tool but it is mainly used for storing and managing logs and events. Logstash is run through the terminal and configurations are written in bash. It is categorized as a log management tool, whereas NiFi is categorized as a data streaming tool [14, 40].

18 3 Data Pipelining Architecture and Prototype

3.1 Introduction

This chapter will describe how we implemented a prototype for data pipelining in Apache NiFi. In part 3.2, a setup which does not use NiFi is described. Part 3.3 goes over the advantages with using NiFi in the pipeline, and part 3.4 shows the experimental setups which will be implemented in NiFi. Lastly, part 3.5 shows what our data pipeline looks like in NiFi, and describes the different processors used to build up the pipeline.

3.2 Current Setup

The goal for this project is to evaluate a new setup for data pipelining that can be used in an Industrial IoT context. In this part, the setup that is currently used at the task provider, Uddeholm AB, is described. This is a setup which only uses Kafka as a tool to transfer raw data from the source to various data sinks. This setup will from here on be referred to as the ”old setup”.

Data collected by sensors in the industrial machines is transferred using either only PLC4x or a combination of a traditional PLC and the data acquisition software iba (more specifically the tool ibaDatCoordinator). This data is sent to the internal Kafka cluster, which is containerized via Docker and orchestrated with Kubernetes. There are three Kafka consumers: a web server, a cloud service (AWS), and the production system. The SCADA system Ignition is used mainly for visualization. A model of the setup can be seen in figure 3.1.

19 Figure 3.1: The current setup at Uddeholm AB

PLC - A PLC, or programmable logic controller, is an industrial computer which has been adapted in order to be used for automation, for example in an assembly line or in robotic devices. [41]

PLC4x - PLC4x is an open-source tool from Apache. It is a universal protocol adapter for industrial IoT. PLC4x has many built-in integration possibilities, of which Apache NiFi and Apache Kafka are two examples. [42] iba - iba is a system for acquiring and analysing process data. The tool ibaDatCoordinator is used for automatic processing and managing of measurements. ibaDatCoordina- tor provides an automatic generation of fault and quality reports, integrated status monitoring, and it notifies when set thresholds have been met. [43, 44]

Docker - Docker is a product for containerizing apps. Via Docker, it is possible to down- load software in containers, and these are hosted on the Docker Engine. The Docker Engine supports all different types of applications, and it works on several different operating systems. [45, 46]

20 Kubernetes - Kubernetes is an open-source system that can be used to group containers into logical units to simplify management. Kubernetes can help manage the contain- ers that run applications, and make sure that downtime is minimized. [47]

Ignition - Ignition is a platform for building and deploying industrial applications. The Ignition software can act as a hub for all systems on the plant floor, which allows for complete system integration. With Ignition, data from databases can be illustrated in the form of graphs and tables. [48]

3.3 Why Bring in NiFi?

The main difference between the old setup and the proposed new setup (described in part 3.4) is that NiFi is brought in as the main data flow manager, partly replacing Kafka. In this part, these two tools will be compared.

Both NiFi and Kafka are able to move data from one node to another, however NiFi is classified as a data pipelining tool, whereas Kafka is classified as a distributed streaming platform. When comparing NiFi and Kafka one of the most obvious differences is that NiFi uses a web-based GUI without the need for the user to write any code if they do not want to. Kafka does not have a GUI and the user will need to write their own code to set up consumers, producers, and topics. This makes NiFi much more user-friendly, and users who do not have a lot of programming experience can find it much more accessible than Kafka. Thanks to NiFi’s GUI, it is also easy to oversee the workflow, something that is harder to do in a similar way in Kafka. Another difference is that Kafka is made for smaller messages and NiFi is made for larger messages and streams. Performance wise it is faster for NiFi to create a single FlowFile out of one million messages and sending that, instead of sending one million

21 one-message FlowFiles. [33] NiFi and Kafka are quite equal when it comes to the level of security they can provide. As described in section 2.4.3, NiFi has several ways to provide secure connections when sending data through the pipelines. One of these ways is to use two-way SSL authentication. Kafka can also provide security in several ways, since the 0.9.0.0 update. These security measures include securing connections to brokers through SSL, SASL, or SASL/PLAIN authentication; encryption of data being sent between brokers and clients, tools, or other brokers through SSL; and authorization of read/write operations done by clients. The Kafka cluster can be protected by making so that users have to use SSH to connect to the cluster with a password. These security measures are all optional, which also is the case with NiFi. [49] Another way to provide security in Kafka, and the way it is done in the old setup, is by using an Avro schema. The Avro schema can minimize and compress the data, and the data can then only be read by someone who has the schema. If the schema is placed at a different endpoint than the data, this provides encryption of the data. [50] In Kafka, all nodes contain metadata about which servers are currently alive and where the partition leaders of a topic are. They can share this metadata with a producer as an answer to a request, and the producer in turn can send this data directly to the broker that is the leader for the partition. [51] NiFi stores all the FlowFile metadata in the FlowFile Repository and all of the provenance event data in the Provenance Repository, as explained in section 2.4.1. It is possible to get metadata about sent messages in Kafka, but it is a bit tricky to do and not as easy as in NiFi. In this aspect, NiFi has a clear advantage over Kafka.

22 3.4 New Pipelining Setups

A new setup was designed as an alternative to the old setup, described in part 3.2. The thought behind this setup is to insert NiFi between Kafka and the sink for the data, and also to add a database and a regular file system as data sources. The method for moving data from machines via PLC’s to Kafka will stay the same for the new setup, however for the test simulations, fake data will be generated in Kafka. Furthermore, the new design has only one data sink, instead of the three data sinks as seen in figure 3.1. This data sink is a distributed file system, more specifically Hadoop Distributed File System, HDFS. According to Erik Hallin at Uddeholm AB, the advantage with using HDFS is that it is more scalable than the current data sinks. An alternative setup was also designed, similar to the new setup but with Kafka inserted between NiFi and the distributed file system. However, the main focus of the testing in this project will be on the new setup, not the alternative new setup.

3.4.1 New Setup

The new setup has NiFi alone for pipelining data from the data sources to the HDFS endpoint, as seen in figure 3.2. NiFi will act as a Kafka consumer for a Kafka topic and forward Kafka messages along with data from a database and files from a remote file system to HDFS. The Kafka producer will publish messages to a topic by use of the kafka-producer-perf-test.sh file, which generates messages to a specified topic, with a specified number of records (messages), record size, throughput, and using a producer configuration file. This is used to simulate data coming in from PLCs and being pub- lished to a Kafka topic. To make sure that NiFi can find the Kafka broker which is on a different machine, the variable advertised.listeners in the NiFi configuration file server.properties is set to PLAINTEXT://[kafka machine ip]:9092. Other than this, there is no need to configure any files in Kafka in any special way, the default values work for the purposes of this experiment.

23 Figure 3.2: New setup with NiFi sending data directly to HDFS

For Hadoop, the file core-site.xml is edited to add a configuration of the property fs.defaultFS to be hdfs://[hdfs machine ip]:9000, and the file hdfs-site.xml is edited to configure the property dfs.replication to be 1 and the property dfs.permissions.enabled to be false. The latter is done to turn off permission checks, which enables NiFi to connect to HDFS. The files hdfs-site.xml and core-site.xml are copied and added to the machine which hosts NiFi, since these are needed to connect NiFi and HDFS.

3.4.2 Alternative New Setup

The alternative new setup is meant to test how it will work to send the data from NiFi to a second Kafka topic before sending it to the endpoint in HDFS, as seen in figure 3.3. This could be interesting if one already has a connection setup between Kafka and Hadoop. Kafka and HDFS are configured as in the main new setup, and the NiFi flow is built up the same way, apart from the final step where the data is sent to a Kafka topic instead of a HDFS directory.

24 Figure 3.3: Alternative new setup with NiFi sending data to HDFS through Kafka

To send the data from Kafka to HDFS, a Python script is used, which uses the kafka- python library to read messages from a topic. [52] The script then adds the contents of each Kafka message along with a timestamp of the current time (expressed in the Unix time stamp format) for each message to a local file. This local file then gets uploaded to HDFS. This method of sending one larger file instead of many small files is used in order to work around the long time (around 2 seconds) it takes to connect to HDFS for the sending of each file. The Python script can be found in Appendix A.1.

25 3.5 NiFi Processors Used

The NiFi processors that are used to create the new setups are described here. For some of the tests the NiFi flow looks slightly different, all but one of the processors (Pub- lishKafka 2 0) used for all of tests can still be seen in figure 3.4 (full-size picture can be found in Appendix A.5). The queue sizes for the connections between the processors were increased to be able to hold 500 000 FlowFiles to try to make sure that the connections would not influence the results. If a queue fills up, back pressure is applied to the previous processor, which slows down the flow.

Figure 3.4: The processors used in NiFi for the new setup. In the green box are processors for consuming from Kafka topic, the yellow box is for getting files from the remote file system, the red box is for pulling rows from the database. The blue box is for clearing out HDFS, and the pink box is for putting files to the remote file system.

26 3.5.1 Consuming from Kafka Topic

ConsumeKafka 2 0 is used to consume messages from a Kafka topic. In this processor’s properties, the IP address and port of the Kafka broker is set, along with the name of the topic to consume data from. Using specifically ConsumeKafka 2 0 is necessary for compatibility with the Kafka system, since it is of version 2.x. For earlier versions of Kafka, there are other versions of the ConsumeKafka processor.

MergeContent is used to be able to merge several FlowFiles together for different setups and configurations of the data flow. This processor is used for all three data sources to test merging of FlowFiles. The properties Minimum Number of Entries and Max- imum Number of Entries are both set to 1 when no merging is to happen, along with a Minimum Group Size of 0 bytes and no Maximum Group Size set. These values are changed to allow for merging when this is tested. The Maximum and Minimum Group Sizes were mainly used to set size intervals for the merged FlowFiles.

PutHDFS is used for all three data sources. This processor needs the directories to the files core-site.xml and hdfs-site.xml, which have been copied to the machine on which NiFi is hosted. PutHDFS also needs the directory in HDFS where the files should be put.

3.5.2 Getting Files from File System

GetSFTP is the processor which gets files from a remote file system hosted on a different machine than NiFi is. This processor uses SFTP (SSH File Transfer Protocol) to connect to the remote file system. To do this, the processor needs the IP address of the remote node, the SSH port (22), and a username with a corresponding SSH key to be able to access the file system, which is in a defined directory. The run schedule of this processor is set to 0.01 seconds, meaning that GetSFTP will pull files from the file system 100 times per second. The property Max Selects is set to 5 000, which

27 means that this is the maximum number of files to be pulled in a single connection.

MergeContent is configured as described in 3.5.1

PutHDFS is configured as described in 3.5.1

3.5.3 Getting Data from MariaDB

When gathering data from MariaDB to HDFS, NiFi needs to use more processors than for the other data sources, as seen in figure 3.4. This is because SQL data needs to be handled in another way than the other data sources. The first processor, QueryDatabaseTable, gathers the data from a chosen SQL table and that data is converted to Avro-format FlowFiles. To be able to process these Avro FlowFiles, NiFi needs to convert them to JSON format FlowFiles, hence the need to use more processors. The files from MariaDB are the files used to measure latency in the tests, and in order to do this it is also necessary to have two ReplaceText processors in this flow. The processors (and controller service) which are used are:

QueryDatabaseTable is used to extract incremental data based on a column from the SQL table where the data is being gathered from. The processor gathers rows from the table, and puts each row in a separate FlowFile. The processor needs to be configured to know which database it should gather data from, what kind of database and which table. This is configured in the properties of the processor. Furthermore, the DBCPConnectionPool controller service needs to be configured for the processor to be able to work and access the database. This controller service is described below. The output of the QueryDatabaseTable processor is Avro format FlowFiles.

DBCPConnectionPool is the controller service that is used to obtain a connection to a specified database. It uses a separate MariaDB Connector, which is a driver which is used to connect Java applications with MariaDB. The controller service needs

28 this driver and the controller service needs to be configured with the database con- nection URL, which contains the IP address along with the port and the name of the database, the class driver name from the MariaDB Connector, location of the MariaDB Connector along with the login credentials for the database.

ConvertAvroToJSON is used to convert the Avro format FlowFiles into JSON format FlowFiles. The output JSON FlowFile is encoded with UTF-8 encoding. The JSON container options property is set to none.

MergeContent is configured as described in 3.5.1.

ReplaceText is used to add a timestamp to the FlowFiles. The timestamp is in the Unix time stamp format, expressed in milliseconds elapsed since midnight on 1st of January, 1970. This timestamp is appended to the end of the FlowFile, and is used to check the latency when sending data through the pipeline.

PutHDFS is configured as described in 3.5.1.

ReplaceText is used a second time to add another timestamp in the FlowFiles. This timestamp represents the approximate time that a FlowFile gets sent to HDFS, since the FlowFiles it gets as input are the successful files from the PutHDFS processor.

PutHDFS is used a second time to store the FlowFiles after the last timestamp is added. These FlowFiles get sent to a different HDFS directory than the other FlowFiles, for easier extraction of timestamps. Other than this, the processor is configured the same way as the other PutHDFS processors.

3.5.4 Other Processors

GenerateFlowFile is a processor which generates FlowFiles which are to be put to the remote file system. The size of these files are set to 1 kB. For the purposes of this

29 experiment, there is no need for unique FlowFiles or a custom text as the content of the FlowFiles.

PutSFTP takes the FlowFiles from GenerateFlowFile and puts them to the remote file system data source. Like with the GetSFTP processor, the IP address of the remote node, the SSH port, and a username with a SSH key are needed to connect to the file system, along with the path in the file system where the files should be put. These are the files that will be fetched in GetSFTP.

GetHDFS is a processor which takes files from HDFS and puts them into the NiFi flow. In our flow however, this processor is only used to empty the contents of HDFS between testing. To make this happen, the property ”Keep Source File” is set to false, and ”Ignore Dotted Files” is set to true, to make sure all files get removed. The process is also set to automatically terminate all successfully retrieved files, since they are not needed in the NiFi flow. As in PutHDFS, the paths to core-site.xml and hdfs-site.xml are specified, as well as the directory which needs to be emptied.

For the data flow which uses Kafka as an intermediary between NiFi and HDFS (for the alternative setup), the processor PublishKafka 2 0 is used in the places where PutHDFS is used in figure 3.4.

PublishKafka 2 0 takes FlowFiles and turns them into messages which are published to a Kafka topic. As with the processor used to consume from a Kafka topic, Pub- lishKafka 2 0 needs the IP address and port of the Kafka broker, along with the name of the topic to which the messages should be published. Since Kafka version 2.x is used, it is necessary to use the 2.0 version of the PublishKafka processor, and not one of the earlier versions.

30 4 Experimental Setup

4.1 Introduction

In the experimentation of this project, a data flow will be implemented in Apache NiFi and its performance will be measured in a number different tests. The data flow will be implemented in a new and alternative new setup, as described in part 3.4, and tested with the purpose of evaluating the setups in terms of latency and throughput. We will use fake data for the purpose of evaluating the performance of the NiFi setup, and the contents of the data is not important, as long as the payload is fixed. NiFi will have the same three sources for data in all the tests: an SQL database, a file system, and a Kafka topic. There will be three different tests. For the first two tests, the data will be delivered to the distributed file system data sink directly from NiFi by using the new setup (as described in part 3.4.1). In the third test the new setup will be compared to an alternative new setup, in which data will be sent from NiFi to the distributed file system via a Kafka topic (as described in part 3.4.2). These tests are done to answer the following questions:

• What is the FlowFile size, accomplished by merging together smaller FlowFiles, that gives the lowest latency for the new setup?

• Is it better to combine FlowFiles from different sources before sending them to the distributed file system, or is it better to keep them separate?

• Is there a difference in latency and throughput between the new NiFi setup and the alternative setup, which sends data from NiFi to a Kafka topic before the distributed file system?

The hypotheses we have based on these questions are:

• The best FlowFile size accomplished by merging in our new setup will be somewhere between doing no merging at all (resulting in 1 kB files) and merging the FlowFiles

31 together to be 128 MB, which is the default block size for HDFS. We expect the value to be closer to 128 MB than 1 kB, since we believe the majority of the latency will come from waiting to be moved to HDFS.

• There will not be a big difference between doing separate and combined merging, the main difference will be that the output files look different as they contain data from different sources and not just one.

• There will either not be any major difference in latency and throughput between the two setups, or the setup which uses Kafka before sending data to the data sink will be slightly slower, since there are more steps for the data to go through.

Part 4.2 describes the specific software used for the database and the distributed file system. Part 4.3 describes the three different compute nodes that are used to make up the data flow, and how these were configured. Lastly, part 4.4 explains the setup of the different tests that are performed, along with the metrics of the tests and how these are measured.

4.2 Additional Software Used

In addition to Apache Kafka and Apache NiFi, some other software is also used to build the setups. There is a need for a cloud-based file system as a data sink. For this, Hadoop Distributed File System (HDFS) was used. As an SQL database data source, MariaDB was used.

4.2.1 Apache Hadoop and HDFS

Apache Hadoop is a software library project which contains several different modules for distributed processing of large sets of data. The Hadoop modules include, among others, Hadoop YARN for scheduling of jobs and cluster resource management, and Hadoop Dis- tributed File System. Hadoop Distributed File System is a distributed file system that is designed to store a smaller amount of very large files (rather than many small files). HDFS

32 runs on a cluster of commodity hardware, and is fault-tolerant through replication of data. HDFS uses a master-slave architecture, where a system has one NameNode as the master, and several DataNodes acting as slaves. Usually, there is one DataNode per participant in a cluster. The DataNodes perform read and write operations on the file system. [53, 54] HDFS is designed to support very large files, and its performance will decrease if needs to manage many smaller files. The default block size of HDFS as of version 2.0 is 128 MB. If a file is bigger than this, it will be split into 128 MB chunks in HDFS. If a file is smaller than the block size, it will still take up a full block, which means that many small files can take up much more space in memory than one large file, even through their initial total sizes were the same. This default value can be changed in the configuration file hdfs-site.xml.1 [55]

4.2.2 MariaDB

MariaDB is an open-source SQL database initially released in 2009, based on a fork of the MySQL database system. MariaDB is included in several Linux distributions, includ- ing Ubuntu, Debian, and Fedora. MariaDB includes features like encryption, authoriza- tion, authentication, and logging, to provide security. It can also provide high availability through replication and clustering. [56, 57] To set up MariaDB for the tests, a database with one table was created. Each row in the table contained a unique id attribute, a timestamp of the time the row was added, and 30 more attributes filled with arbitrary data. By doing so, each row contains roughly 1 000 characters, which is equivalent to approximately 1 kB of data. An SQL script (available in Appendix A.2) was used to add rows to the database table at the same time as NiFi pulls rows from the table, which was done to have the timestamps close to the time when the rows are pulled into NiFi.

1This was unfortunately noticed by us very late in the project, when we did not have time to perform further testing where the block size was changed.

33 4.3 Compute Nodes

During the experiment, three different compute nodes are used to host the different software and systems. Each node is implemented through a Virtual Machine (VM). A visualization of the nodes can be seen in figure 4.1. Download links to the software used can be found in Appendix A.3.

Figure 4.1: The three compute nodes used for the experiment.

4.3.1 Node 1

The first node is a VM which is accessed through SSH. The VM runs with Ubuntu version 18.4 as its (OS), 8 GB of RAM, and 4 CPU kernels with 2.5GHz clock frequency. Java Development Kit (JDK) version 11.0.6 is downloaded on the VM for Kafka to be able to run. This node contains three different data sources that send data to NiFi. These data sources are: a standard file system, a MariaDB SQL database version 10.4.12, and a Kafka broker and producer sending messages to a topic. The Kafka version that is used is 2.12-2.2.0.

34 4.3.2 Node 2

The second node is the same kind of VM as node 1, with Ubuntu 18.04 and JDK version 1.8.0. This node hosts NiFi version 1.11.4 in a terminal, however to access NiFi’s GUI it is necessary to use a web browser, which is not available in the VM. The GUI is accessed through visiting [node 1 ip]:8080/nifi/ in a web browser of choice. NiFi gets data from the sources in node 1, and sends data to the sinks in node 3.

4.3.3 Node 3

The third node is the same kind of VM as on node 1 and 2, with the same versions of Ubuntu as node 1. The third node contains a Kafka version 2.12-2.2.0 broker and consumer, and Hadoop version 3.2.1, which contains a HDFS module. Due to requirements from Hadoop, the JDK version for node 3 is 1.8.0. Hadoop is set up in a pseudo-distributed mode, meaning that it runs on a single node but each Hadoop daemon runs in a separate process, imitating a fully distributed mode. [58] Kafka is only used on this node for the alternative new setup used in the third test.

4.4 Experiment Description

For the evaluation of the new data pipeline setups in NiFi, three different tests were performed to compare and evaluate how these setups in NiFi performs in terms of latency and throughput under different configurations.

4.4.1 Performance Metrics

The metrics that are measured and compared in the evaluation are latency and throughput. The latency measured in the experiments is defined as the difference in time between when a chunk of data is sent from its source and when the data arrives at its destination. The latency is measured by comparing timestamps which are stored in the files that are

35 sent from MariaDB. As a row gets added to the table in MariaDB, it gets a timestamp of the current time as one of its attributes. When the row gets turned into a FlowFile in NiFi, the timestamp becomes part of the contents of the FlowFile. Right before the MariaDB FlowFiles get sent to HDFS or when a FlowFile arrives at Kafka, a second timestamp is added to the contents of the FlowFile. For the third test, these timestamps are compared to calculate latency between creation and sending to HDFS. For the first and second test, a third timestamp is added after the FlowFiles have been sent to HDFS, to get a more accurate time of when the files actually get to HDFS, and this is the timestamp used to calculate the total latency for the first and second tests. For these tests, the difference between the first and second timestamps is also used to see how the latency is distributed over the data flow, i.e. how much time is spent between row creation in MariaDB and before being sent to HDFS, compared to the total latency. In all the tests, a sample of 50 random but evenly distributed FlowFiles are collected and used to calculate an average latency and show how the latency is distributed. The latency results are representative of the worst-case values for all of the measurements where FlowFiles are merged together. This means that the timestamp that is logged as the ”starting time” is the first timestamp in a file. This represents the creation time of the FlowFile that had to wait in the merging queue the longest, the one that arrived at the queue first. It was deemed more interesting to look at these times than the time for the FlowFile that arrived last and had to wait the shortest time in the queue. For the case when no merging is done, each file only contains one ”starting time”, so this is the time that is used as the first timestamp in this case. All three VM’s used were verified to be synchronized with regards to time.

The throughput is measured in bytes received in HDFS per second (bytes/second), which is calculated by dividing the number of received bytes by the amount of seconds NiFi was running. Calculating how long NiFi was running is done through subtracting the first timestamp of the first file to get to HDFS from the final timestamp of the last file to get to HDFS.

36 4.4.2 Test Descriptions

For the evaluation of the data pipeline setups in NiFi, three tests will be performed. The purpose of the first test is to find the merging size of FlowFiles which results in the lowest latency in our new setup in NiFi. Files get pulled from the sources (file system, Kafka topic, and MariaDB table), and are then separately merged with the MergeContent processor, with different merging configurations. The merged sizes being compared for this test are 10 kB, 50 kB, 100 kB, 1 000 kB and 2 000 kB. For the MariaDB and file system sources, each file is initially 1 kB large which means that 10 files are needed to get a 10 kB FlowFile, 50 files for a 50 kB FlowFile, etc. The default size for Kafka records is set to be 100 B, since this is the approximate size of the Kafka records in the old setup. This means that 10 times more Kafka files need to be merged together than for the other sources. Additionally, one measurement is done when there is no merging done to the FlowFiles. Since not all files are exactly 1 kB (or 100 B for the Kafka records) large, the MergeContent processor’s properties are configured to accept an interval of output FlowFile sizes. If this is not done, FlowFiles can get trapped in the queue before the MergeContent processor, since it might be impossible to create an output FlowFile of the exact required size. The intervals used are presented in table 4.1. These intervals were chosen to make sure that most of the FlowFiles in the data flow could pass through the merge content processor without having to spend unnecessary time in a queue. Table 4.1: Intervals for achieving different merging sizes

Size (kB) Interval (kB) 10 8-12 50 40-60 100 80-110 1 000 900-1 100 2 000 1 900-2 100

The NiFi flow for this test can be seen in figure 3.4 (full-size version in Appendix A.5).

37 Figure 4.2: NiFi data flow for the second test.

The second test also consists of merging several small FlowFiles into fewer and bigger FlowFiles. In this test, data from the three sources (MariaDB, Kafka topic, and file system), are combined and merged together in the same MergeContent processor, see figure 4.2 (a full-size version of the figure can be found in Appendix A.5). The purpose of this test is to compare the latency from this test to the one measured in the first test, in which the sources are merged separately. For this test the size of the combined merged FlowFiles are 10 kB, 50 kB and 1 000 MB, with the same intervals of output FlowFile size set in the MergeContent processor as in the first test. The third test compares the new NiFi setup (as described in part 3.4.1) with the alternative new setup (as described in part 3.4.2) , to try to evaluate the differences in latency and throughput between the two. In this test, the new setup is tested first with only MariaDB as a source, with each row in the database table being 1 kB large. Kafka, which has now been changed to have the record size 1 kB, is then added as another source of data packets. Lastly, the file system is added as a third 1 kB data source. In each part of the test, the FlowFiles are separately merged together in 10 kB batches with the same interval as the first test, to minimize queue times to send data to HDFS. The same

38 Figure 4.3: NiFi data flow for alternative new setup used for the third test. procedure is done with the alternative new setup, see NiFi flow in figure 4.3 (full-size version of the figure can be found in Appendix A.5). This flow has a PublishKafka 2 0 processor instead of the PutHDFS processor in the new setup, and the ReplaceText processors have also been removed. Instead of setting a timestamp in the FlowFiles as they leave NiFi to be sent to HDFS (which is what happens in the new setup), a second timestamp is added in node 3, after the files have gone through a Kafka topic and before they get sent to HDFS. The latency that is measured in the third test is the time from row creation in MariaDB to when the file is ready to be sent to HDFS. This is because there was some trouble setting up a good and fast connection between Kafka and HDFS, and we did not want our bad configuration to affect the results. In this test, the throughput is also measured and compared.

39 5 Results & Evaluation

5.1 Introduction

This section goes through the results of the experiments which were performed. The raw data that was used to make the graphs and plots is available in Appendix A.4. Section 5.2 presents and evaluates the results from test 1, section 5.3 goes over the results from test 2, and finally section 5.4 describes the results from test 3.

5.2 Results of Test 1

Figure 5.1: Average FlowFile latency for different merging sizes.

As seen in figure 5.1, there is a vast difference in the latency between doing no merging (which is expressed as merging 1 kB in the graph) and doing any kind of merging. The average latency for sending files from NiFi to HDFS when merging is done lies between 45 and 112 milliseconds. In comparison, in the case where the FlowFiles are not merged the files need to queue for an average of 106 331 milliseconds, almost 2 minutes, to get through the PutHDFS processor. This large difference between doing no merging and merging to

40 10 kB FlowFiles was not expected. As can be seen in figure 5.1, the PutHDFS processor works much better when putting 10 kB files to HDFS than 1 kB files, more than 10 times better. The reason for this is not fully clear, but a speculation is that it is a sign of HDFS’s preference for larger files over small files.

It is only for the first measuring point that the latency between NiFi and HDFS (the difference between the red and blue line in figure 5.1) is significant. For all merging measuring points the main part of the latency happens between the starting point in MariaDB and the point in NiFi before the files are sent to HDFS, the actual sending of files to HDFS takes relatively little time. This latency distribution can be seen more clearly in figure 5.2. This figure shows the percentage of the total latency which is spent between the row creation in MariaDB and before sending files to HDFS. Figure 5.2 shows more clearly what can be seen in the difference between the two lines in figure 5.1, which is that it is only for the measuring point for no merging that the majority of the latency for a FlowFile is spent waiting to be sent to HDFS. Again, this is because the files in that case have to wait for almost 2 minutes on average to get through the PutHDFS processor. However, just because the percentage of time spent between MariaDB and the PutHDFS queue is lower for no merging than merging, it does not mean that less time is spent at this stage. As can be seen in figure 5.1, the actual time spent getting to the PutHDFS processor is larger for 1 kB file size than 10 kB. This is likely a symptom of NiFi having to handle more files when the files are not merged together.

Since the results for not merging FlowFiles are significantly different from, and so much worse than, the results from merging the files, these results will be disregarded from now on. The focus will instead be on the differences between different sizes of merging, starting on merging together 10 kB of FlowFiles. Based on the latency distribution seen in figure 5.2 and doing similar calculations for the other tests, in the rest of the graphs only the total latency will be displayed. The majority of the total latency happens between row creation in MariaDB and getting to the queue before the first PutHDFS processor in NiFi.

41 Figure 5.2: Percentage of the total average FlowFile latency made up by the time between MariaDB and NiFi, before being sent to HDFS.

Figure 5.3: Average FlowFile la- Figure 5.4: Latency distribution for tency for different merging sizes. different merging sizes.

Figure 5.4 shows the spread of values that created the average total latency (from row creation in MariaDB to FlowFile having been sent to HDFS) displayed in figure 5.3. The values showed in these graphs are based on the same data as in figure 5.1, but without the value for no merging. These graphs show how the latency for the files increase as more

42 kilobytes are merged together before sending the files to HDFS. This is because FlowFiles need to wait in a queue before the MergeContent processor until enough FlowFiles have arrived to be merged together. The larger the merged files are, the longer on average the files will have to wait in the queue before being merged together, as more FlowFiles are needed to get to the merging size. So while it is faster to send few and large files to HDFS, it is slower within the NiFi flow to have to merge together large numbers of FlowFiles. There is also a memory issue in NiFi when trying to merge together very many files. During the testing, an attempt was made to merge together 10 000 kB of FlowFiles, but this made NiFi crash since too much data needed to be held in the memory at the same time as the queues filled up. The FlowFile, Content, and Provenance Repositories and the log filled up with large amounts of data, and these needed to be emptied manually before it was possible to start NiFi up again. This lead to the conclusion that while it is better for HDFS to get very large files (preferably up to 128 MB when this is set as the block size), it is not feasible to merge together as many 1 kB FlowFiles in NiFi as are needed to get to these sizes. Based on the results of this test, 10 kB was chosen as the ”default” merging value for the other tests, since it had the lowest average latency of the tested merging sizes. By doing this instead of having the default be to do no merging at all, the very high latency between NiFi and HDFS that happens when no merging is done (as shown in figure 5.1) can be avoided.

43 5.3 Results of Test 2

Figure 5.5: Average FlowFile la- Figure 5.6: Latency distribution for tency for different merging sizes, combined and separate merging. comparing separate and combined merging.

The results of the second test are shown in figures 5.5 and 5.6. Figure 5.5 shows the difference in average latency for separate and combined merging and how the average latency increases as the merge size of FlowFile increases. The difference in average latency between the separate and combined merging for a 10 kB merging size is about 1000 ms, where it is faster to do separate merging. For the other measuring points it is faster to do combined merging than separate merging. When the merging size is 1 000 kB, the difference in latency between the merging methods is about 5 000 ms, with combined merging being faster. This shows that it is faster to use combined merging than separate merging for larger FlowFile sizes, and for smaller merging sizes there is less of a difference. The deviating value for combined merging at 10 kB can be somewhat explained by looking at the distribution of latency in figure 5.6. This figure shows the distribution of latency for both combined and separate merging. This shows that the spread of values for combined merging is much greater for 10 kB compared to separate merging. Most of the

44 higher latency values happened in the first 30 seconds of the test. The reason for this is not clear, but it is plausible to think that in a real life scenario where the NiFi flow runs over large periods of time, the latency for combined merging would stabilize at a value close to the one that is observed with separate merging. The distributions for the other measuring points are less spread out, which means that it is easier and safer to draw conclusions based on these values. However, only for the 1000 kB values is there a significant difference in latency between separate and combined merging (and combined merging is significantly faster in this case). The results of the tests for 50 kB and 10 kB have overlapping intervals, which means that the differences are not statistically significant. The reason that combined merging is faster than separate merging in the case where the results are significantly different is that for combined merging, the queue before the MergeContent processor is shared between the three data sources. This means that the queue fills up to the merging size faster, and FlowFiles do not have to wait for as long to be merged together. When setting the merging size to be large (1 000 kB in this test), the queue waiting time for a FlowFile is relatively long when sources are being merged separately. If the sources are merged together however, this queue time can be decreased. This result also shows what could be seen in the first test, that the main part of the latency happens when waiting to be merged together, and that when this time can be decreased, the latency gets significantly lower.

45 5.4 Results of Test 3

Figure 5.7: Average throughput for the two different sources with different amounts of 1 kB sources.

Figure 5.7 shows the throughput results for the two different setups used in the third test. The throughput is roughly the same for the two setups when it comes to using one and two sources, where the difference is less than 1 kB per second for both of these measuring points. The only major difference is when all three sources are used, where the alternative setup (using Kafka before sending data to HDFS) has more than 100 kB per second higher throughput. However, the reason for this big difference for three sources is not known. Because of how similar the values for the two setups are for one and two data sources, it is reason- able to think that the throughput for three sources (when the file system is added as a source) should be around 300 kB/s for both setups. When the third test was first per- formed, the throughput for the alternative setup was around 700 kB/s when using the same processors and file systems configurations as when testing the new setup (apart from the PublishKafka 2 0 processor). Some tweaks were made to the content of the files getting pulled from the file system, since some special characters seemed to be taking up more

46 space than 1 byte when being sent through Kafka. After this, the result shown in figure 5.7 was achieved. Since the expected value of about 320 kB/s could not be obtained for the alternative setup, the throughput of about 440 kB/s was decided to be close enough. The reason behind the deviating value is still unclear, it could depend on the implementation of the alternative new setup, or some other impact or involvement that we do not know about. Since this value is so deviating from the rest of the values in the test, and for no clear reason, it will not be regarded as reliable to draw conclusions on.

Figure 5.8: Average FlowFile la- Figure 5.9: Latency distribution for tency for the two different setups. the two different setups.

The average latency for both setups are shown in 5.8. This figure shows the average latency for each respective setup and how it changes for each source added. When looking at the average latency, the results seem to increase some for each added source for both the new and the alternative setup, and the new setup seems to have slightly lower latency on average. However, when looking at the latency distribution in figure 5.9, it is clear that the differences in latency are not significant. This is likely due to the fact that the implemen- tations of the two setups are very similar, with only one more step through Kafka has to be done for the alternative setup compared to the new setup.

47 One of the reasons for why there was such a small difference in performance between the two setups might be that we did not implement a connection between Kafka and HDFS in the alternative new setup. In this setup, NiFi sends the FlowFiles to a Kafka topic, where they get gathered to a single file by a Python script (available in Appendix A.1). The file then gets sent to HDFS through the terminal. If a connection between Kafka and HDFS had been implemented, the speed of this connection could be tested against the one between NiFi and HDFS. This needs to be taken into consideration when comparing the performance of both setups.

5.5 Conclusion of Results

The results that were obtained through the various tests were somewhat as expected in our hypotheses (presented in section 4.1). The first test gave some slightly unexpected results. The difference between not merging FlowFiles in NiFi at all and merging them together into 10 kB FlowFiles was not expected to be so large, as shown in figure 5.1. Other than this, it was expected that some merging would be better than doing no merging at all, and that large merging sizes would lead to long queues before merging in NiFi. The results showed that the lowest latency results were found when the merging size was between 10 and 100 kB, and this result was then used to set the default merging size to 10 kB for the following tests. The results from the second test show that it is beneficial to combine different data sources before sending the data to HDFS for larger merging sizes. Though it might seem from the average latency shown in figure 5.5 that it is better to do separate merging for small merging sizes, figure 5.6 shows that the spread of values for combined merging at a 10 kB merging size is very big, and both for a merging size of 10 kB and 50 kB the results overlap. This means that there is no significant difference in the latency for 10 kB or 50 kB merging size for this implementation. Since combined merging means that the data sources share a queue for filling up the merging batches, it is generally preferred to combine

48 sources before merging with regards to latency, and doing this also gave significantly better results when the merging size was set to 1 000 kB. The third and final test gave mostly expected results. We were not expecting there to be any significant difference in the latency or throughput for the two setups being compared, since they are so similar. The box plot in figure 5.9 shows that there is no significant difference in the latency for the setups. Looking at figure 5.7 makes it seem like the alternative setup has a much higher throughput when using three data sources, but this is thought to be due to implementation issues. This result is not deemed reliable to draw conclusions on. Instead, the other measuring points lead to the conclusion that the throughput increases quite linearly for each added 1 kB data source, and the difference between the setups is negligible. When it comes to simplicity of implementation however, the new setup wins over the alternative setup. The alternative setup needs one more step than the new setup, and as been already mentioned, it was hard to set up a good connection between Kafka and HDFS (something that was much easier to accomplish between NiFi and HDFS in the new setup).

49 6 Conclusions

6.1 Project Summary and Evaluation

The purpose of this project was to learn about data flows and data pipelines, and more specifically the tool Apache NiFi. A data pipeline setup was designed in NiFi, which connected an SQL database, a file system, and an Apache Kafka topic to a distributed file system. After successfully setting this data flow up, some tests were performed to see how the system handled different workloads, and when the best results were achieved. The three tests were to determine what size NiFi FlowFiles should be merged into, if files from different data sources should be kept separate or merged together, and finally a test to compare the data flow setup with an alternative setup. The last test has a Kafka topic as an intermediary between NiFi and the distributed file system. The results showed that merging FlowFiles together into 10 kB files gave the lowest latency, and merging together FlowFiles from all sources gives better latency than keeping them separate. Finally, it was shown that there was no significant difference between the NiFi flow that sent data directly to the distributed file system, and the flow that first sent data to a Kafka topic. A lot of time for this project, especially in the beginning phases, went into investigating and gathering information about Apache NiFi and the other tools used for this project. This was necessary in order for us to understand how all the tools and systems work, before we started setting up our implementation. At times this information gathering was quite hard, since the information available often left much to be desired. Most of the tools only had a too sparse or very complicated ”documentation” page on the product website, and maybe a Wikipedia page. Luckily, NiFi itself had several official documentation pages, and there are also some YouTube videos from official NiFi people with some good information. The NiFi implementation phase of this project was relatively easy to handle. NiFi does not require much in terms of configuration to be able to work and handle data flows. The main problem in this phase was to make sure that NiFi could establish connections

50 to the other nodes and software. This was done through configuring the properties of each processor in NiFi and was a bit of hurdle, but it was still doable. Our biggest implementation problem however was to learn how to use and implement Apache Kafka and HDFS. A large part of our implementation setup went into making sure that Kafka was working between nodes, and setting up HDFS on one node. By experiencing ourselves how much easier NiFi is to set up than Kafka is, we definitely prefer NiFi over Kafka. The main problem for setting up HDFS was to find a way to simulate a distributed file system on only one node. After some help from our mentor however, this was solved. The other tools used for the setup, MariaDB and the (non-distributed) file system were relatively easy to implement and set up. We had experience working with SQL databases before the project, so setting up the MariaDB database was quite straight-forward. Setting up a directory for the file system in one of the VM’s was also very straight-forward. While setting up NiFi and MariaDB on their own was quite simple, setting up the con- nection between the two took some time. The documentation on how to use the controller service, which is needed to create this connection, is quite sparse and unclear at times. The project has shown that it is fully possible to implement a data pipelining setup with Apache NiFi getting data from several different sources, and sending the data to a HDFS sink. Even though there have been plenty of bumps along the road, the final implementation ended up being quite easy to manage and test.

6.2 Future Work

One interesting scenario to test against the two new setups presented in this report would be a NiFi setup that completely cuts out Kafka from the equation. In this setup, data from a PLC would be piped directly into NiFi, without going through Kafka first. One way this could be done is by using the NiFi sub-project MiNiFi. This is a setup that was suggested by Uddeholm AB, but it had to be cut out due to time. Another setup that the task provider suggested was to test NiFi as an ETL (Extract,

51 Transform, Load) tool. In this setup, NiFi would extract data from a SQL or NoSQL database, transform it in some way (for example perform a join operation), and then load the data back to the table. This was decided to be slightly too out of the scope for the report, but it could still be interesting to explore.

In the setup which uses Kafka in between NiFi and HDFS, a good connection between Kafka and HDFS was not found. Due to time limitations it was decided to instead put data to HDFS manually (by terminal command) in the alternative setup. Setting up a good, fast connection between these two tools, and comparing the latency between the data source and HDFS would give a better view of how the two different setups compare.

One thing that would be very interesting to test would be to change the block size in HDFS. This could be done by for example recreating the first test in this project, but with various block sizes. We unfortunately did not have time to test this since we did not know it was a possibility to change the block size until very late into the project.

Another expansion of the project that could be made is to perform the same tests, but take more measuring points. For example, it would be interesting to see what the latency in the first test looks like at some points between no merging and 10 kB merging size.

There are many variables in both the setups we tested and in NiFi itself that could be changed to possibly increase the performance of a data flow. For example, looking more at the different properties in the nifi.properties file could be interesting. Further, it would be interesting to set up the same data flow in several different plat- forms, such as Apache Kafka, Airflow, Storm, etc. Seeing how these different tools compare in terms of latency, throughput, and simplicity of usage would be very good for someone who is interested in setting up a data flow, but unsure which tool to use.

Along the same line, an expansion of the project could be to compare the results achieved from the new setups to the old setup. Since the task provider might change from the old setup to the new one, it would be of interest to see how these differ in terms of performance.

52 Lastly, another interesting project would be to explore working with NiFi’s cluster- ing functionality. We decided to only work with a single node of NiFi, but it would be interesting to see how using the distributed mode of NiFi would affect the performance.

53 References

[1] M. Gidlund, S. Han, U. Jennehag, E. Sisinni, and A. Saifullah, “Industrial Internet of Things: Challenges, Opportunities, and Direc- tions,” IEEE Transactions on Industrial Informatics, vol. 14, no. 11, pp. 4724–4734, Nov. 2018. [Online]. Available: https://ieeexplore-ieee- org.bibproxy.kau.se/stamp/stamp.jsp?tp=&arnumber=8401919&tag=1

[2] Louis Columbus, “10 Charts That Will Challenge Your Perspec- tive Of IoT’s Growth,” 2018, [Accessed: 2020-05-12]. [Online]. Avail- able: https://www.forbes.com/sites/louiscolumbus/2018/06/06/10-charts-that-will- challenge-your-perspective-of-iots-growth/#6b67b71a3ecc

[3] Cisco, “Cisco Global Cloud Index: Forecast and Methodology, 2013–2018,” 2014, [Accessed: 2020-04-15]. [Online]. Available: https://www.terena.org/mail- archives/storage/pdfVVqL9tLHLH.pdf

[4] Various, “Pipeline (computing),” 2020, [Accessed: 2020-03-31]. [Online]. Available: https://en.wikipedia.org/wiki/Pipeline (computing)

[5] A. M˘at˘acut, ˘a and C. Popa, “Big Data Analytics: Analysis of Features and Performance of Big Data Ingestion Tools,” Informatica Economic˘a, vol. 22, no. 2, pp. 25–34, 2018. [Online]. Available: http://revistaie.ase.ro/content/86/03%20- %20matacuta,%20popa.pdf

[6] J. L¨onnegren and S. Nystr¨om, “Processing data sources with big data frameworks,” 2016. [Online]. Available: http://www.diva- portal.org/smash/get/diva2:934359/FULLTEXT01.pdf

[7] Various, “Internet of Things,” 2020, [Accessed: 2020-04-14]. [Online]. Available: https://en.wikipedia.org/wiki/Internet of things

[8] ——, “Industrial Internet of Things,” 2020, [Accessed: 2020-04-15]. [Online]. Available: https://en.wikipedia.org/wiki/Industrial internet of things

[9] D. A. Patterson and J. L. Hennessy, Computer Organization and Design - The Hard- ware/Software Interface, 5th ed. Waltham, MA, US: Morgan Kaufmann, 2014.

[10] The Apache Software Foundation, “Apache Airflow,” 2020, [Accessed: 2020-02-18]. [Online]. Available: https://airflow.apache.org/

[11] ——, “Apache Spark,” 2020, [Accessed: 2020-02-18]. [Online]. Available: https://spark.apache.org/

54 [12] ——, “Apache Storm,” 2020, [Accessed: 2020-02-18]. [Online]. Available: http://storm.apache.org/index.html [13] Microsoft, “Data Factory,” 2020, [Accessed: 2020-02-18]. [Online]. Available: https://azure.microsoft.com/sv-se/services/data-factory/ [14] Elasticsearch B.V., “Logstash Introduction,” 2020, [Accessed: 2020-02-18]. [Online]. Available: https://www.elastic.co/logstash [15] Evan Parker, “What is a Data Pipeline,” 2019, [Accessed: 2020-05-06]. [Online]. Available: https://www.xplenty.com/blog/what-is-a-data-pipeline/ [16] Amazon Web Services, Inc., “What is Streaming Data?” 2020, [Accessed: 2020-04-23]. [Online]. Available: https://aws.amazon.com/streaming-data/ [17] Java Platform, “Introduction to Batch Processing,” 2017, [Accessed: 2020-04-16]. [Online]. Available: https://javaee.github.io/tutorial/batch-processing001.html [18] Various, “Apache Kafka,” 2020, [Accessed: 2020-02-10]. [Online]. Available: https://https://en.wikipedia.org/wiki/Apache Kafka [19] The Apache Software Foundation, “Apache Kafka - Introduction,” 2020, [Accessed: 2020-02-06]. [Online]. Available: https://kafka.apache.org/intro [20] K. M. M. Thein, “Apache Kafka: Next Generation Distributed Messag- ing System,” International Journal of Scientific Engineering and Technology Research, vol. 03, no. 47, pp. 9478–9483, Dec. 2014. [Online]. Available: http://ijsetr.com/uploads/436215IJSETR3636-621.pdf [21] Benjamin Reed, “ProjectDescription,” 2012, [Ac- cessed: 2020-02-10]. [Online]. Available: https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription [22] Various, “Apache Nifi,” 2019, [Accessed: 2020-02-06]. [Online]. Available: https://https://en.wikipedia.org/wiki/Apache NiFi [23] Apache NiFi Team, “Apache NiFi Overview,” 2020, [Accessed: 2020-02-10]. [Online]. Available: https://nifi.apache.org/docs.html [24] Mastercard Incorporated, “Using NIFI to simplify data flow & streaming use cases @ Mastercard,” 2018, [Accessed: 2020-02-19]. [Online]. Available: https://www.youtube.com/watch?v=JjjjtgZIK6I [25] Comcast Corporation, “Data Ingest Self Service and Management us- ing Nifi and Kafta,” 2017, [Accessed: 2020-02-19]. [Online]. Available: https://www.youtube.com/watch?v=YGo7Ggvaguc

55 [26] Hashmap Incorporated, “Powered by Apache NiFi,” 2020, [Accessed: 2020-02-19]. [Online]. Available: http://nifi.apache.org/powered-by-nifi.html [27] Groupe Renault, “Best practices and lessons learnt from Running Apache NiFi at Renault,” 2018, [Accessed: 2020-02-19]. [Online]. Available: https://www.youtube.com/watch?v=rF7FV8cCYIc [28] Apache NiFi Team, “Apache NiFi In Depth,” 2020, [Accessed: 2020-02-25]. [Online]. Available: http://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html [29] ——, “NiFi Developer’s Guide,” 2020, [Accessed: 2020-02-06]. [Online]. Available: http://nifi.apache.org/developer-guide.html [30] The Apache Software Foundation, “Apache Nifi - Features,” 2020, [Accessed: 2020-02-03]. [Online]. Available: https://nifi.apache.org/index.html [31] Apache NiFi Team, “NiFi System Administrator’s Guide,” 2020, [Accessed: 2020-02- 12]. [Online]. Available: https://nifi.apache.org/docs/nifi-docs/html/administration- guide.html [32] , “Apache NiFi Security Reference,” 2019, [Accessed: 2020-02- 26]. [Online]. Available: https://docs.cloudera.com/HDPDocuments/HDF3/HDF- 3.4.0/nifi-security/hdf-nifi-security.pdf [33] Bryan Bende, “Integrating Apache NiFi and Apache Kafka,” 2016, [Accessed: 2020-02- 19]. [Online]. Available: https://bryanbende.com/development/2016/09/15/apache- nifi-and-apache-kafka [34] DataWorks Summit, “Intelligently collecting data at the edge—intro to Apache MiNiFi,” 2018, [Accessed: 2020-03-07]. [Online]. Available: https://youtu.be/4m3Uuz3RpLg [35] The Apache Software Foundation, “PublishKafka,” 2020, [Accessed: 2020-02-19]. [Online]. Available: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kafka-2-0- nar/1.9.2/org.apache.nifi.processors.kafka.pubsub.PublishKafka 2 0/ additionalDetails.html [36] ——, “ConsumeKafka,” 2020, [Accessed: 2020-02-19]. [Online]. Avail- able: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kafka- 2-0-nar/1.9.2/org.apache.nifi.processors.kafka.pubsub.ConsumeKafka 2 0/ additionalDetails.html [37] ——, “Spark Streaming Programming Guide,” 2020, [Accessed: 2020-04-07]. [Online]. Available: https://spark.apache.org/docs/latest/streaming-programming-guide.html

56 [38] ——, “Apache Storm - Simple API,” 2019, [Accessed: 2020-04-07]. [Online]. Available: http://storm.apache.org/about/simple-api.html

[39] Microsoft, “Data Pipeline Pricing,” 2020, [Accessed: 2020-04-07]. [Online]. Available: https://azure.microsoft.com/en-us/pricing/details/data-factory/data-pipeline/

[40] Elasticsearch B.V., “Logstash,” 2020, [Accessed: 2020-04-07]. [Online]. Available: https://www.elastic.co/guide/en/logstash/current/introduction.html

[41] Various, “Programmable logic controller,” 2020, [Accessed: 2020-02-10]. [Online]. Available: https://en.wikipedia.org/wiki/Programmable logic controller

[42] The Apache Software Foundation, “PLC4x,” 2020, [Accessed: 2020-02-10]. [Online]. Available: https://plc4x.apache.org/

[43] iba AG, “iba system,” 2020, [Accessed: 2020-02-11]. [Online]. Available: https://www.iba-ag.com/en/iba-system/

[44] ——, “ibaDatCoordinator,” 2020, [Accessed: 2020-02-11]. [Online]. Available: https://www.iba-ag.com/en/ibadatcoordinator/

[45] Docker Inc., “Why Docker?” 2020, [Accessed: 2020-02-11]. [Online]. Available: https://www.docker.com/why-docker

[46] ——, “The Industry-Leading Container Runtime,” 2020, [Accessed: 2020-02-11]. [Online]. Available: https://www.docker.com/products/container-runtime

[47] The Kubernetes Authors, “What is Kubernetes,” 2020, [Accessed: 2020- 02-11]. [Online]. Available: https://kubernetes.io/docs/concepts/overview/what-is- kubernetes/

[48] Inductive Automation, “Ignition,” 2020, [Accessed: 2020-02-11]. [Online]. Available: https://inductiveautomation.com/ignition/

[49] Confluent Incorporated, “Kafka Security,” 2020, [Accessed: 2020-02-25]. [Online]. Available: https://docs.confluent.io/3.0.0/kafka/security.html

[50] The Apache Software Foundation, “,” 2020, [Accessed: 2020-02-25]. [Online]. Available: https://avro.apache.org/docs/current/

[51] ——, “Apache Kafka - Documentation,” 2020, [Accessed: 2020-02-26]. [Online]. Available: https://kafka.apache.org/documentation/#design

[52] Python Software Foundation, “kafka-python 2.0.1,” 2020, [Accessed: 2020-04-06]. [Online]. Available: https://pypi.org/project/kafka-python/

57 [53] Dataflair team, “HDFS Tutorial – A Complete Hadoop HDFS Overview,” 2020, [Accessed: 2020-03-12]. [Online]. Available: https://data-flair.training/blogs/hadoop- hdfs-tutorial/

[54] The Apache Software Foundation, “Apache Hadoop,” 2020, [Accessed: 2020-04-01]. [Online]. Available: http://hadoop.apache.org/

[55] Szele Balint, “The Small Files Problem,” 2009, [Accessed: 2020-04-08]. [Online]. Available: https://blog.cloudera.com/the-small-files-problem/

[56] Various, “MariaDB,” 2020, [Accessed: 2020-03-31]. [Online]. Available: https://en.wikipedia.org/wiki/MariaDB

[57] MariaDB, “MariaDB Enterprise Server,” 2020, [Accessed: 2020-03-31]. [Online]. Available: https://mariadb.com/docs/features/mariadb-enterprise-server/

[58] The Apache Software Foundation, “Hadoop: Setting up a single node cluster.” 2020, [Accessed: 2020-04-07]. [Online]. Available: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop- common/SingleCluster.html#Pseudo-Distributed Operation

58 A Appendix

All appendix content is uploaded to a GitLab repository, apart from the pictures in Ap- pendix A.5. Specific links are given to each appendix, link to full repository is: https://git.cse.kau.se/linavilh100/appendixapachenifi.git

A.1 Python Script for Processing Kafka Messages https://git.cse.kau.se/linavilh100/appendixapachenifi/-/blob/master/pythonscript

The Python code is for messages that originated in the MariaDB table, which are the files that contain a timestamp for when they were created. The other files were processed with a similar script, where the only difference is that the topic name was topic3 instead of topic2, and the file that is written to is called secondfile.txt instead of file.txt. This was done to facilitate finding the timestamp files when extracting the raw data.

A.2 SQL Script for Loading Rows into MariaDB https://git.cse.kau.se/linavilh100/appendixapachenifi/-/blob/master/sqlscript

A.3 Software Download Links https://git.cse.kau.se/linavilh100/appendixapachenifi/-/blob/master/downloads

A.4 Raw Data https://git.cse.kau.se/linavilh100/appendixapachenifi/-/tree/master/data

59 A.5 Pictures

Figure A.1: Full-size version of Figure 3.4 60 Figure A.2: Full-size version of Figure 4.2

61 Figure A.3: Full-size version of Figure 4.3

62