Implementation and Evaluation of a Data Pipeline for Industrial Iot Using Apache Nifi

The Faculty of Health, Science and Technology Computer Science Pontus Sjöberg, Lina Vilhelmsson Implementation and Evaluation of a Data Pipeline for Industrial IoT Using Apache NiFi Bachelor's Project 2020:06 Implementation and Evaluation of a Data Pipeline for Industrial IoT Using Apache NiFi Pontus Sjöberg, Lina Vilhelmsson c 2020 The author(s) and Karlstad University This report is submitted in partial fulfillment of the requirements for the Bachelor's degree in Computer Science. All material in this report which is not my own work has been identified and no material is included for which a degree has previously been conferred. Pontus Sjöberg Lina Vilhelmsson Approved, June 01, 2020 Advisor: Prof. Andreas Kassler Examiner: Per Hurtig iii Abstract In the last few years, the popularity of Industrial IoT has grown a lot, and it is expected to have an impact of over 14 trillion USD on the global economy by 2030. One application of Industrial IoT is using data pipelining tools to move raw data from industry machines to data storage, where the data can be processed by analytical instruments to help optimize the industrial operations. This thesis analyzes and evaluates a data pipeline setup for Industrial IoT built with the tool Apache NiFi. A data flow setup was designed in NiFi, which connected an SQL database, a file system, and a Kafka topic to a distributed file system. To evaluate the NiFi data pipeline setup, some tests were conducted to see how the system performed under different workloads. The first test consisted of determining which size to merge a FlowFile into to get the lowest latency, the second test if data from the different data sources should be kept separate or be merged together. The third test was to compare the NiFi setup with an alternative setup, which had a Kafka topic as an intermediary between NiFi and the endpoint. The first test showed that the lowest latency was achieved when merging FlowFiles together into 10 kB files. In the second test, merging together FlowFiles from all three sources gave a lower latency than keeping them separate for larger merging sizes. Finally, it was shown that there was no significant difference between the two test setups. v Acknowledgements We want to thank our mentor at Karlstad University, Andreas Kassler, for helping us with writing our report and guiding us through the project. We also want to thank Erik Hallin, our mentor at Uddeholm AB, for helping and guiding us with all the different tools and software we used throughout the project, and giving us insight in how our implementation might be used in an industry context. Lastly, we want to thank Uddeholm AB for letting us do this project for them. vi Contents 1 Introduction 1 2 Background 3 2.1 Introduction . .3 2.2 Concepts . .3 2.2.1 Industrial Internet of Things . .3 2.2.2 Data Pipelining . .4 2.2.3 Data Streaming . .5 2.3 Apache Kafka . .6 2.3.1 Topics . .6 2.3.2 Cluster . .6 2.3.3 Producers . .7 2.3.4 Consumers . .7 2.4 Apache NiFi . .8 2.4.1 Primary Components . .9 2.4.2 Extensions . 11 2.4.3 Security . 12 2.4.4 Cluster . 13 2.4.5 Compatibility . 14 2.5 NiFi as a Producer and Consumer for Kafka . 14 2.5.1 MiNiFi . 14 2.5.2 NiFi as a Producer . 15 2.5.3 NiFi as a Consumer . 16 2.6 Related Tools . 16 2.6.1 Apache Airflow . 16 2.6.2 Apache Spark . 17 vii 2.6.3 Apache Storm . 17 2.6.4 Azure Data Factory . 18 2.6.5 Logstash . 18 3 Data Pipelining Architecture and Prototype 19 3.1 Introduction . 19 3.2 Current Setup . 19 3.3 Why Bring in NiFi? . 21 3.4 New Pipelining Setups . 23 3.4.1 New Setup . 23 3.4.2 Alternative New Setup . 24 3.5 NiFi Processors Used . 26 3.5.1 Consuming from Kafka Topic . 27 3.5.2 Getting Files from File System . 27 3.5.3 Getting Data from MariaDB . 28 3.5.4 Other Processors . 29 4 Experimental Setup 31 4.1 Introduction . 31 4.2 Additional Software Used . 32 4.2.1 Apache Hadoop and HDFS . 32 4.2.2 MariaDB . 33 4.3 Compute Nodes . 34 4.3.1 Node 1 . 34 4.3.2 Node 2 . 35 4.3.3 Node 3 . 35 4.4 Experiment Description . 35 4.4.1 Performance Metrics . 35 viii 4.4.2 Test Descriptions . 37 5 Results & Evaluation 40 5.1 Introduction . 40 5.2 Results of Test 1 . 40 5.3 Results of Test 2 . 44 5.4 Results of Test 3 . 46 5.5 Conclusion of Results . 48 6 Conclusions 50 6.1 Project Summary and Evaluation . 50 6.2 Future Work . 51 References 54 A Appendix 59 A.1 Python Script for Processing Kafka Messages . 59 A.2 SQL Script for Loading Rows into MariaDB . 59 A.3 Software Download Links . 59 A.4 Raw Data . 59 A.5 Pictures . 60 ix List of Figures 2.1 A simplified view of the Kafka architecture . .7 2.2 NiFi's GUI . .9 2.3 A simplified view of the NiFi architecture . .9 2.4 A simplified view of a NiFi Cluster . 14 3.1 The current setup at Uddeholm AB . 20 3.2 New setup with NiFi sending data directly to HDFS . 24 3.3 Alternative new setup with NiFi sending data to HDFS through Kafka . 25 3.4 The processors used in NiFi for the new setup . 26 4.1 The three compute nodes used for the experiment. 34 4.2 NiFi data flow for the second test. 38 4.3 NiFi data flow for alternative new setup used for the third test. 39 5.1 Average FlowFile latency for different merging sizes. 40 5.2 Percentage of the total average FlowFile latency made up by the time between MariaDB and NiFi, before being sent to HDFS. 42 5.3 Average FlowFile latency for different merging sizes. 42 5.4 Latency distribution for different merging sizes. 42 5.5 Average FlowFile latency for different merging sizes, comparing separate and combined merging. 44 5.6 Latency distribution for combined and separate merging. 44 5.7 Average throughput for the two different sources with different amounts of 1 kB sources. 46 5.8 Average FlowFile latency for the two different setups. 47 5.9 Latency distribution for the two different setups. 47 A.1 Full-size version of Figure 3.4 . 60 A.2 Full-size version of Figure 4.2 . 61 A.3 Full-size version of Figure 4.3 . 62 x List of Tables 4.1 Intervals for achieving different merging sizes . 37 xi 1 Introduction The Industrial Internet of Things, or Industrial IoT, is a subset of Internet of Things (IoT) specific for an industrial use, and it covers the machine-to-machine and industrial com- munication parts of IoT. [1] Industrial IoT has grown a lot in the past few years, and it is expected by some to have an impact on the global economy of over 14 trillion USD by the year 2030. [2] Industrial IoT focuses on integrating and interconnecting already exist- ing devices, whereas "consumer" IoT (e.g. smart devices) focuses more on creating new devices. An example of an Industrial IoT application is collecting large amounts of data from industrial machines, and sending this data to various analytical tools which then can optimize the industrial operations, based on how the machines are performing currently. [1, 3] One way this can be done is by the use of data pipelining tools, which are tools for moving data from one place to another. [4] In this project, we will try to evaluate data pipelining and data flows in the context of Industrial IoT by creating data pipelining setups in the tool Apache NiFi (or, simply NiFi) and try to find the best way to include NiFi into an architecture where data needs to be moved from several starting points into a cloud-based file system. To evaluate this setup, we will also be testing the performance of the data flow setup in NiFi with different configurations and workloads. Currently, there are very few scientific papers available that test the performance of a NiFi data flow. [5, 6] Therefore, the result of this project is interesting to the task provider Uddeholm AB, as they are looking into using NiFi as part of their data streaming architecture. The specific tests performed in this project are designed with Uddeholm AB in mind, to answer the questions they have about the performance of a NiFi data flow setup. 1 The disposition of the report is as follows: In Chapter 2, some background to the technologies, concepts, and tools used is given. The technologies and concepts described in this chapter are Industrial IoT, data pipelining, and data streaming. The tools Apache Kafka and Apache NiFi are also described in detail in this chapter, along with some shorter information about some alternative data.

Load more