IT16048 Examensarbete 30 hp Juni 2016

Handling Data Flows of Streaming Internet of Things Data

Yonatan Kebede Serbessa

Masterprogram i datavetenskap Master Programme in Computer Science i Abstract Handling Data Flows of Streaming Internet of Things Data Yonatan Kebede Serbessa

Teknisk- naturvetenskaplig fakultet UTH-enheten Streaming data in various formats is generated in a very fast way and these data needs to be processed and analyzed before it becomes useless. The technology currently Besöksadress: existing provides the tools to process these data and gain more meaningful Ångströmlaboratoriet Lägerhyddsvägen 1 information out of it. This thesis has two parts: theoretical and practical. The Hus 4, Plan 0 theoretical part investigates what tools are there that are suitable for stream data flow processing and analysis. In doing so, it starts with studying one of the main Postadress: streaming data source that produce large volumes of data: Internet of Things. In this, Box 536 751 21 Uppsala the technologies behind it, common use cases, challenges, and solutions are studied. Then it is followed by overview of selected tools namely Apache NiFi, Telefon: Streaming and studying their key features, main components, and 018 – 471 30 03 architecture. After the tools are studied, 5 parameters are selected to review how

Telefax: each tool handles these parameters. This can be useful for considering choosing 018 – 471 30 00 certain tool given the parameters and the use case at hand. The second part of the thesis involves Twitter data analysis which is done using Apache NiFi, one of the tools Hemsida: studied. The purpose is to show how NiFi can be used for processing data starting http://www.teknat.uu.se/student from ingestion to finally sending it to storage systems. It is also to show how it communicates with external storage, search, and indexing systems.

Handledare: Markus Nilsson Ämnesgranskare: Matteo Magnani Examinator: Edith Ngai IT16048 Tryckt av: Reprocentralen ITC iii Acknowledgment

It is with great honor that I express my gratitude to the ”Swedish Institute” for awarding me the ”Swedish Institute Study Scholarship” for my Masters study at Uppsala University, Uppsala, Sweden. I also like to extend my gratitude for my supervisor Markus Nilsson for providing me with a chance to work the thesis at Granditude AB and give me important feedback on this report and my reviewer Matteo Magnani from Uppsala University for being my reviewer and see my progress each time. My gratitude also goes to the whole team at Granditude for being supportive and provide good working environment. And last but not least, I would like to thank my family and friends for their prayers and support. Thank You!

iv Contents

1 Introduction 1 1.1 Problem Formulation and Goal ...... 1 1.2 Scope and Method ...... 2 1.3 Structure of the report ...... 3 1.4 Literature Review ...... 3

2 Internet of Things Overview 4 2.1 Technologies in IoT ...... 4 2.1.1 RadioFrequencyIdentification(RFID) ...... 5 2.1.2 Wireless Sensor Network (WSN) ...... 5 2.1.3 TCP/IP (IPv4,IPv6) ...... 5 2.1.4 VisualizationComponent...... 5 2.2 ApplicationAreas...... 6 2.2.1 SmartHome...... 6 2.2.2 Wearable ...... 6 2.2.3 Smart City ...... 6 2.2.4 IoT in Agriculture - Smart Farming and Animals ...... 7 2.2.5 IoTinHealth/ConnectedHealth ...... 7 2.3 Challenges and Solutions ...... 8 2.3.1 Challenges ...... 8 2.3.2 Solutions ...... 9

3OverviewofTools 11 3.1 ApacheNiFiHistoryandOverview ...... 11 3.1.1 NiFi Architecture ...... 12 3.1.2 Key Features ...... 14 3.1.3 NiFiUIcomponents ...... 16 3.1.4 NiFiElements...... 17 3.2 ApacheSparkStreaming ...... 22 3.2.1 Key Features ...... 22 3.2.2 Basic Concepts and Main Operations ...... 23 3.2.3 Architecture...... 24 3.3 ApacheStorm...... 25 3.3.1 Overview ...... 25 3.3.2 BasicConceptsandArchitecture ...... 25 3.3.3 Architecture...... 26

v 3.3.4 Features ...... 27

4 Review and Comparison of the Tools 28 4.1 Review ...... 28 4.1.1 ApacheNiFi...... 28 4.1.2 Spark Streaming ...... 29 4.1.3 ApacheStorm...... 31 4.2 Differences and Similarities ...... 32 4.3 Discussionoftheparameters...... 32 4.4 Howeachtoolhandlestheusecase ...... 35 4.5 Summary ...... 38

5 Practical analysis/Twitter Data Analysis 39 5.1 Problem definition ...... 39 5.2 Setup ...... 40 5.3 Analysis ...... 40 5.3.1 Data Ingestion ...... 41 5.3.2 Data Processing ...... 42 5.3.3 Data Storage ...... 43 5.3.4 Data Indexing & visualization ...... 44 5.3.5 Data Result & Discussion ...... 46 5.3.6 DataAnalysisinSolr...... 48

6 Evaluation 50

7 Conclusion and Future Work 54 7.1 Future Work ...... 54

References 57

Appendix - , 2.0. 62

vi vii Chapter 1

Introduction

The number of connected devices to the internet is increasing each year at an alarming rate. It is expected that 50 billion devices will be connected to the internet by 2020 according to Cisco of which most of the connections are from Internet of Things (IoT) devices such as wearable, smart home appliances, connected cars and many more [1][2]. And large volume of data is produced from these devices in a very fast rate that needs to be processed in real-time to gain more insight from it. There are different kinds of tools designed to process only one form of data; either static or real-time, or some are designed for processing both static and real-time. This thesis project mainly deals with handling/processing of real-time data flows after a thorough study on some selected stream analytic tools has been made. The thesis project is done at Granditude AB [4]. Granditude AB provides advanced data analytics and big data solutions built up on open source to satisfy the needs of their customers. The company mainly uses open source frameworks and projects in the Hadoop ecosystem.

1.1 Problem Formulation and Goal

There are different types of data sources; namely real-time and static data sources. The data produced by real-time sources has characteristics such as fast, continuous, very large, structured/unstructured. And the data from the static source is a historical data stored which is very large and is used for enriching the real-time data. Real-time data being produced in a fast way; it has to be processed by the rate it is produced before it is perished. So this is one problem which streaming data face that the data may not be processed fast enough. The data coming from these two sources need to be combined, processed and analyzed to provide meaningful information out of the analysis that in turn is vital for making better decisions. But this is also another area of problem for stream data flow processing where these data from both sources is not combined due to poor integration of the different sources (static & real-time) together or data coming from different mobile devices that will result in a data that is not analyzed properly, not enriched from historical data and hence produce poor result. The other problems of streaming data that makes its handling or processing difficult is inability to adopt to real-time changing conditions as for example when errors occur. There are many tools which mainly process stream data; but studying, understanding,

1 and using all these platforms as they come is not scalable and not covered in this work. This project aims to make processing of flow of streaming data using one tool. To achieve this, first overview of selected tools in this area is done and then the tool that is going to be used in the analysis is chosen after review and discussion of the tools using certain parameters and use case. This thesis project generally tries to answer questions such as: What tools currently exist that are used for data extraction, processing, and also • analysis? - to study some of the selected tools in this area - architecture, key features, components Based on the study, describing which tool is good for a particular use case? • Which tool best handles both static and real-time data produced for analysis? • which tool enables to make changes in the flow easily? • The defined use case consists of both real-time and static data to be processed and analyzed. The real-time data is tweets from Twitter API and the static one is initially stored tweets from NoSQL database HBase. Then the two data sources need to be combined and filtered out based on given properties. Based on the filtered result, incorrect data will be logged in to another file while the correct data will be stored to HBase. Finally, some of the filtered data will be indexed into Solr which is an enterprise search platform. In this process, we will see what happen to each input sources before and after they are combined together. What techniques are used to merge, filter, what priority levels should be given to each of them, are also some of the questions that are answered during these stage. The basis for separating the data as correct and incorrect is also defined.

1.2 Scope and Method

The project is mainly divided in two parts which are: Theoretical and Practical/Analysis part. In the theoretical part, IoT will be studied as it uses many devices that produce these large amounts of data in a fast way. In addition to that, the challenges it has and the solutions that should be taken, and common use cases that are existing in IoT are covered. Next to that, overview of some of selected tools/platforms is done which consists of the study of their main components, features, and common use cases. Besides these, the tools are further reviewed by defining use case and certain parameters and see how each of the tools handle the parameters defined. Finally, based on the discussion result one tool will be selected to use it for analysis part of the project. The tools are chosen based on the requirement that they should be data processing or streaming tools and are within the Hadoop framework. Based on this requirement, the tools chosen are: Apache Spark Streaming, Version 1.6.0 • Apache Storm, Version 0.9.6 • Apache NiFi, Version 0.6.0, HDF 1.2.0 • In the practical part, a particular use case is used to showcase how the analysis is done using one of the tools studied.

2 1.3 Structure of the report

Here the structure of the paper is briefly outlined. In Chapter 2, Overview of Internet of Things is done which comprises of technologies that make up IoT and common use cases. The challenges and solutions of IoT is also discussed briefly. Chapter 3 deals with Overview of selected tools (Apache NiFi, Apache Spark Streaming, Apache Storm). It discusses about the key features of each tool, their architecture, and different compo- nents/elements they have. Chapter 4 is a continuation of the previous chapter; it defines certain parameters and use case to discuss the characteristics of the tools to see how each of the tools behave. Finally based on the discussion, one tool is selected to use it in the practical part. In Chapter 5, practical phase of the project is discussed. It uses the tool chosen from the previous step to make Twitter data analysis. Chapter 6 discusses about the evaluation of the tool against performance. Finally Conclusion and Future work is outlined in Chapter 7.

1.4 Literature Review

Many of the papers discuss the technologies involved, common use cases, and the chal- lenges IoT is facing and solutions for that. For example, giant technological company, Ericsson is engaged in the IoT Initiative (IoT-i) with the objective of increasing the benefits and possibilities of IoT and identifying and proposing solutions to tackle the challenges with a team that comprise of both industries and academia [1]. In [2] Mio- randi et al. present a survey of technologies, applications, and research challenges for the IoT. The survey also suggests RFID as the basis for the IoT technology to spread widely. In [3] Cisco white paper defines IoT as the Internet of Objects that changes everything considering the different ways of our lives that it impacts such as in education, commu- nication, business, science, and government. Different IoT application areas are also discussed in their report “Unlocking the Poten- tial of the Internet of Things”,bytheMcKinseyGlobalInstitute[4]whichdescribesthe broad range of potential applications that include home, vehicles, humans, cities, facto- ries as settings. In [5] the white paper discusses how IoT is being used in the health care to improve access to care, increase quality and reduce cost for care. Some of their products include “Massimo radical-7” for clinical care and “Sonamba Daily Monitoring” Solution for early intervention/prevention which can be used as wearable devices. Weber approaches IoT from the perspective of an Internet-based global architecture and dis- cusses a significant impact on the privacy and security of all stakeholders involved [6]. Spark Streaming uses Discretized Streams (DStreams) which is defined by Zaharia et al. in [7] as stream programming model that is capable of integrating with batch systems and provide consistent and efficient fault recovery. Since Apache NiFi is a new framework/tool for data flow management and processing, papers regarding the study of its features, programming models and so on could not be found readily. So the study of the tool is made mostly by referring and studying from its page [8].

3 Chapter 2

Internet of Things Overview

Internet of Things (IoT) as defined by International Telecommunication Union (ITU) [9], is a global infrastructure for the information society enabling advanced services by interconnecting things based on existing and evolving inter-operable Information Commu- nication Technology (ICT). It was first coined by Kevin Ashton in 1999 in MIT Auto-ID Labs [10]. Internet of Things (IoT) as a name stands is a combination of two words: Internet and Things [11]. An Internet is a network of networks interconnecting millions of computers globally using a standard communication protocol, TCP/IP. A Thing is any physical or virtual thing that can be identifiable, distinguishable and be given an address as in [11]. Examples of Things include humans, cars, food, different machines, electronic devices which can be sensed and connected [11]. So when combined Internet of Things refers as a technology that seamlessly interconnects these “Things” using existing and evolving communication technologies and standards anywhere, anytime that is capable of exchanging information, data and resources between them. Internet of Things aims at making these things smarter in a way that makes them to get information without or little human intervention. By these it allows communication between Human-to-Human (H2H), Human-to-Things (H2T) and Things-to-Things (T2T) providing unique identity to each and every object which is described in [11]. In the subsequent subsections, the technologies that are used in IoT, common use cases and the challenges that IoT is facing currently and their solutions are discussed.

2.1 Technologies in IoT

Different kinds of technologies are used in the IoT applications. Basically they can be categorized as Hardware, Middleware and a presentation component [12]. The Hardware components include things such as embedded sensors while the Middleware consists of application tools for analysis. The presentation component is all about how this analyzed data is presented to the end user, i.e visualization in different platforms. Below are some of the main technologies behind IoT implementations.

4 2.1.1 Radio Frequency Identification (RFID) RFID is a wireless microchip that enables to uniquely identify “Things”.Itwasfirst founded in Auto-ID lab in MIT in 1999 [10]. It is an easy, reliable, efficient, and secured technology and it comes cheap compared to other devices. It consists of a reader and one or more tags that can be active, passive, semi passive based on the computational power and the sensing capability [12]. Passive tag RFID does not use battery while the active one uses their own battery. RFID has various uses such as personal identification, distribution management, tracking, patient monitoring, vehicle management, and so on.

2.1.2 Wireless Sensor Network (WSN) It is also one of the main technologies used in IoT which can communicate information remotely in a different ways. It has smart sensors with micro controllers that enable to gather, process, analyze and distribute certain measurements such as temperature fluctuations, sound, pressure, health heart beat rates instantly in real time [11].

2.1.3 TCP/IP (IPv4,IPv6) TCP/IP is a protocol that identifies computers on a network. There are two kinds of TCP/IP protocol namely IPv4 and IPv6. IPv4 is currently the most widely used but most of its address spaces are being depleted. When we consider IoT that interconnects anything, IPv4 will not be good choice because of the less number of address spaces it has. So the new version IPv6 is a good solution for future when everything is connected because it has very huge address space that can provide an address and uniquely identify anything [11]. Even if it is not yet being used widely nowadays, it is a future for IoT when thinking about connecting almost anything because of its large address space available.

2.1.4 Visualization Component This is also important component of IoT because without a good visualization, interaction of the user with the environment is not achievable [12]. It needs to be noted that while designing any kind of visualization for IoT, the way the analyzed data is presented matters to make better decisions. That means easy to understand and user friendly interface products need to be designed that use already existing technologies such as touch screen for display purposes in smart phones, tablets, and other devices according to the needs of the end user.

5 2.2 Application Areas

IoT is a future of technology where all things are interconnected to exchange data and provide information for the better of the society. There are a lot of application areas that already break IoT market and some not widely deployed yet. Examples of common IoT Application areas include Transportation which has many domains such as Traffic management, parking for vehicles, highway and road construction, smart vehicles for the public. IoT can also make infrastructures to be available with reduction in costs and resources by providing smart metering for utilities in water and light distribution and smart grid systems. So all these applications and many others show that IoT is being applied in all kinds of areas and lives for better services and promises that it will be used even more widely in the future. In the next subsections, selected IoT use cases are discussed briefly.

2.2.1 Smart Home Smart Home is a technology that enables almost all home appliances that are used in daily basis to be connected to the internet or to each other [13]. This helps to provide better services and act according to the preference of the owner. The home appliances may include Heating, Ventilation, and Air Conditioning (HVAC) systems, microwave ovens, lighting systems, refrigerator, garage, Smart TVs and so on. Examples include controlling the temperature of the house, lighting systems in the rooms and checking whether the oven is on/off. These things can be deployed in a smart home environment and can also be monitored by voice control from a smart phone (Siri and HomeKit from Apple products for example) [14].

2.2.2 Wearable This area of IoT is also getting popular nowadays as more and more wearable devices are getting manufactured. Wearable is a small mobile electronic device that comes with wireless sensor communication capability to process and gather information [15]. Wearable devices can work by themselves or by being connected to the smart phone via a Bluetooth. Examples include smart watches, wrist band sensors and rings to mention a few. For example, smart watches provide a variety of uses for the individuals such as email notifications, alerts for messages and incoming calls being connected via Bluetooth. The other kind of wearable that is being used widely is wrist band sensors where they can be applied for interactive exercise and activity tracking (heart beats, pulse rates, etc...)[16]. Examples include Apple smart watch, Samsung Gear smart watch, and Google Glass.

2.2.3 Smart City It is a technology that delivers smart urban services to the general public maintaining safer environment and minimizing cost. It aims at using the available resources wisely and effectively to provide better services while reducing operational costs in doing so [17]. The different areas where IoT in the city can be deployed include e-governance, traffic management, parking services, street road lighting and many more [17]. It can

6 also be used in pollution reduction which arise from traffic congestion in bigger cities hence playing vital role towards sustainability of the city.

2.2.4 IoT in Agriculture - Smart Farming and Animals This IoT application area is a promising area especially in countries where economies are mainly dependent on agricultural productions. It is a technology where traditional agri- cultural equipment such as tractors have smart sensors that measure the temperature, humidity of the soil and water distribution. It also include animals in agricultural farms where they will be identified using RFIDs [18][19]. It enables animals to be traced and detected in real time when outbreak of contagious diseases occurs. This technology can also be used for preventive maintenance of the equipments well in advance. It revolution- ize how traditional farming is done and will take it to the next level of using the data generated from the embedded smart sensors to get better productions and make better decisions such as what seeds to plant, the expected crop yields and water utilization levels. It also enables farmers to deliver their products directly to the consumers [19].

2.2.5 IoT in Health/Connected Health It is one of the most widely used IoT use case. It is a technology that enables hospitals and patients to be connected remotely. Connected health technology keeps the patients to be connected 24/7 which enables monitoring their health conditions and sending data to the hospital which in turn helps doctors to flexibly control and monitor their patients’ well being. These can be achieved by using smart phones and embedding wearables in the form of implantable that work remotely into their bodies, or palms so that these devices transfer generated data for further process to the doctor’s end; notifying emergency conditions, trace symptoms for health threats well in advance[5]. This is vital for both hospitals and patients. For the first, the number of doctors to patients’ ratio is not at equally distributed stage, so this technology enables doctors to follow more patients from where they are which rather was not possible without this technology. The other benefit is, since the data is gathered by the devices, the tendency of error to exist is minimized than it was to be entered by human intervention which in turn is readily available to the doctors fastening better decision making. For the patients’ side, it is good for emergency cases and it enables preventive care for the patients especially the elderly people [5].

7 2.3 Challenges and Solutions

As there are a lot of emerging applications and evolving technologies in the IoT field, the challenges it has also increased with these growing trends in applications and technologies. In the following subsection, major challenges and solutions are discussed.

2.3.1 Challenges There are a lot of challenges the IoT fields is facing currently. Bandwidth and battery problems in small devices, power disrupts related to the devices and configurations are some of the problems that this field is facing [12][20]. Apart from these, it can be generalized that the major challenges facing the IoT are: Data Security, Data Control & Access, No uniform standard/structure, and Large Volume of Data produced.

1. Data Security : Data security in terms of IoT is defined as a necessity to ensure availability and continuity of a certain application and to avoid potential operational failures and interruptions with internet connected devices. The threats here could come in different levels such as at the device level, network or system/application levels. They also come in a variety of ways such as using arbitrary attacks such as Distributed Denial of Service (DDoS) and malicious software [20]. Different devices such as sensors, RFID tags, Cameras, or network services (WSN,Bluetooth) could be vulnerable to such attacks which in turn can be used as botnets [21]. Home appliances such as refrigerator, TVs can also be used as botnets to attack this and similar devices.

2. Data Control & Access/Privacy : It is known that IoT applications produce large volume of data in a faster rate from different devices and these smart devices collect and process personal information [22]. But knowing what risks these devices have, how the data is produced and used, who owns and controls it and who has access to it are some privacy questions that one need to ask while getting the services of these devices. It is obvious that the data that is produced from these devices face privacy concerns from the users. The concerns most of the times could come in two forms [20], first where personal information about the individual is controlled, identified, and the owner does not know who access it or to whom it is known. Secondly, the individuals’ physical location can be traced and be known his/her whereabouts, hence violating privacy. This shows that privacy is one of the basic challenges in the IoT field as is anywhere in the IT field.

3. No Uniform Standards/Structures : IoT is comprised of different components such as hardware devices, sensors and applications. These different components are manufactured and developed by different industries. When these components designed to be used in IoT solutions, they need to exchange data. Problems arise while trying to communicate because the standard used in one product is not used in the other and it creates communication or data exchange problems which may hinder the expansion of IoT products. The problem is not only in the design of devices, but also in the internet protocols used today. Currently working standard protocols for the internet are not compatible with IoT implementations [20], so

8 sometimes ad-hoc protocols from different vendors are being used for example in wireless communications. The absence of uniform standard/structure for different technologies used in IoT is one challenge for the field.

4. Large Volumes of Data Produced : This is another challenge in IoT, i.e the data produced from various sensor and mobile devices is heterogeneous, continuous, very large and fast. These produced data need to be processed instantly before it is expired. Managing these kinds of data is against the capacity of traditional databases. As the number of connected devices is expected to increase in the future, the data produced from these devices is going to increase exponentially and a good analytic platform and storage systems are needed.

2.3.2 Solutions As the challenges of IoT are larger, solutions that adhere to these challenges should be developed and come into work to provide better services that are trusted by all parties such as users, companies, and so on. Some of the solutions include using standard en- cryption technologies that comply with IoT. Since the devices are mobile, the encryption technologies that are going to be used must be faster and less energy consuming because energy consumption is another problem of IoT devices. Using authentication and autho- rization schemes for controlling access level to view the data is also another solution that should be considered while designing IoT applications.

Some of the solutions put regarding the problems discussed include:

1. Having Uniform Shared Standards/Structures : This is helpful in a way that having a standard protocols or structures makes vendors to follow this struc- ture and will not create a problem when there is a need to integrate the different parts developed by different organizations. For example if hardware and sensor de- vice designers, network service providers and application developers all follow some standard for IoT, it will greatly reduce the problem that will arise due to integration problem and compatibility issues [20].

2. Making Strong Privacy Policy for IoT : Strong privacy policy towards IoT on how to collect, and use individual data in a transparent way to the user increases the trust of the user for the service, make him/her aware how it is used and how to control it. This is to mean that the user should be made the center to decide on what personal information goes where and how it is used [23].

3. Using Anonymization : Anonymization is a method of modifying personal data so that nothing is known about the individual. It does not only include de- identification by removing certain attributes but has to also be linkable because alargevolumeofdataisbeingproducedeachtime[24].MethodssuchasK- anonymity can be used.

4. Robust storage systems : As the data produced from IoT devices is large volume data, it is needed to have fast and powerful storage mechanisms such as fault

9 tolerant NoSQL databases which can handle very large data even more than it is needed currently.

10 Chapter 3

Overview of Tools

In this chapter, three tools that are mainly used in the analysis of streaming data will be studied. The tools chosen are Apache NiFi, Apache Spark Streaming and Apache Storm. Their general overview and features will be reviewed which can be used as a basis for study of their similarity and differences in the next chapter.

3.1 Apache NiFi History and Overview

Apache NiFi originally named “Niagara Files”, was first developed by the National Se- curity Agency (NSA) in the United States in 2006, and has been used there for 8 years. It was first developed to automate data flow between systems [25]. Later by November 2014, it was donated to Apache Software Foundation (ASF) as a Technology Transfer Program by the NSA. In July 2015, it has become Top Level Project for ASF and 6 releases of NiFi exist the time this paper is written (0.6.0). Data Flow is an automated and managed flow of data between systems. This means that there is a flow of information from one system to another which one can be considered as a producer and the other as a consumer. So these flows of information need to be guaranteed to make sure that they reach the intended parties at the time needed. But it is clear nowadays that there are a lot of challenges or problems that data flow between systems are facing. It is much more a challenge today than earlier times because in those days organizations do not have very large systems that are exchanging information; they only have one or two systems that are not of a big problem or too complex to integrate and exchange data/information between them. But currently, there are a lot of challenges for data flow management systems to handle different sets of data.

The major problems of data flow are:

Integration problem: This is a problem because the different systems existing • in the organizations have different architectures and even the new built systems may not consider the architectures of the existing ones’. Integrating the different systems existing in the organization is beneficial to both the organization and the users. For the company, the systems are integrated means information can easily flow between the different systems which in turn is good for better decision making.

11 For the users, they would be able to get what they request in a fast and easy way without knowing exactly where each module or functions are found. And having integrated system with good data flow provides this to the users efficiently and effectively.

Priorities of organizations change over time: This is to mean that, what was • considered invaluable at one time may be considered as a valuable thing next and need to be considered while making decisions. So in these kinds of conditions, the data flow system must be robust and fast enough to handle the new changes that occur and adapt to the existing ones without affecting other flows.

Compliance and Security: This is a problem for data flow management systems • because whenever organizational policies or business decisions change, there will be apossibilitythatdatasecuritywillbemistreatedwhentryingtoadheretothenew laws or decisions. So systems must always be kept secured for users whether there is a change in organizational policies or business decisions which again enhances data flow management.

NiFi supports running environment ranging from a laptop to many enterprise servers depending on the size and nature of the data flow involved. It also requires larger or sufficient disk space as it has many repositories (content, flow file, provenance) where their content is stored in a disk. It can run on any machine with major operating sys- tems (Windows, Linux, Unix, Mac OS) and its web interface rendered on latest major browsers such as Internet Explorer, Firefox and Google chrome.

3.1.1 NiFi Architecture NiFi supports both standalone and cluster mode processing. Their features are discussed below.

Standalone Architecture NiFi requires Java so that the JVM holds it and the amount of memory it uses depends on the JVM. It has a web server inside the JVM where it displays its components in user friendly UI. The flow file, content, and provenance repositories are all stored inside alocalstorage.

12 The different parts of the architecture as in Figure 3.1 from [8]:

Flow controller: it is the main part of NiFi architecture that control thread • allocation for the different components.

Processor: it is the main building block for NiFi and it is controlled by the flow • controller.

Extensions: operate within JVM and holds the different extension points in NiFi. • Flow File Repository: is a place where NiFi keeps track of the state of active • processor. It uses Write-Ahead logging which lives on a specified disk partition.

Content File Repository: holds actual content of given flow file and are stored • in the file system.

Provenance Repository: it holds information about the data; what happen to • the data, how and where it moves over some period of time beginning from its origin. These whole information is indexed and hence making the search easy.

Figure 3.1: NiFi standalone Architecture - source [8]

Cluster Architecture NiFi can also be used in a cluster, where the NiFi cluster manager (NCM) is the master and the other NiFi instances connected to it are the Nodes (Slaves). In this model, it is the Nodes that make the actual processing of the data and the NCM is for managing and monitoring the changes.

NiFi cluster uses a site-to-site protocol which enables it to communicate with other NiFi

13 Figure 3.2: NiFi Cluster Architecture - source [8] instances, other cluster, or other systems such as Apache Spark. Figure 3.2 shows that the Nodes are only communicating with the NCM and not with each other. The com- munication between the Nodes and the NCM can be by Unicast or Multicast. When one Node fails, the other nodes will not automatically pick the load but rather it is the NCM that calculates the load balance and distribute it to another Node. The other functions of the NCM are: communicate data flow changes to all the Nodes, receives health (whether they are working properly) and status information from the Nodes. The nodes are regularly checked for load balancing by the master so that they are given flow files to process according to their load. As many Node instances as possible can be added horizontally to the master cluster whenever there is a need to add as long as the NCM is working and operating.

3.1.2 Key Features Apache NiFi has a lot of useful features that allows providing better flow management mechanisms when compared with other systems. It can be said that it is designed by learning the drawbacks that other systems have. These features are also considered as an advantage it has over other systems.

14 The points below are some of the main features of NiFi [8][26]. Flow specific Quality of Service (QoS) : It comprises Guaranteed delivery Vs • Loss Tolerance, Latency Vs Throughput. The QoS achieved when considering the flow relates to how the flow is configured to give high throughput with low latency and being loss tolerant. NiFi is designed to be loss tolerant in a way that data loss is unacceptable. Guaranteed delivery is achieved by using both content repository and persistent Write-Ahead Logging (WAL). By this it keeps track of changes made to the flow file’s attributes and also the connection the flow file belongs [27]. Then it writes these changes to the log before they are written to the actual disk. And finally, the contents will be written to the disk. This is important for recovery and prevents data loss. Latency is the time required to process the flow file from beginning to end. And Throughput is the result that is get after processing of a flow file in a given time. It also describes how much flow file is processed in a given time at once i.e the micro batching in a specified time. Every time the processor finishes processing specific flow file, before sending it to the next component, there need to have an update of the repository which is expensive and takes much more time. Since this process is expensive it is good that more work is done i.e more flow files are micro batched for processing at once in a given time. The problem with this is the next component or processor cannot start until the repository is updated and it has to wait until these flow files are processed, hence producing latency. So this has to be solved in order to provide better processing speed and as well better throughput. NiFi enables to provide lower latency with higher throughput while configuring the processor in the settings tab hence the average and suitable point can be chosen to get the best result according to the need. Friendly User Interface (command and control): NiFi provides a friendly • User Interface (UI) running on a browser that is designed using HTML5, Drag and Drop mechanisms with JavaScript technologies. Having the UI is useful espe- cially when flow files become complex and managing them will be very tough from the console. NiFi achieves this by having easy command and control mechanism that enables making the changes to a specific flow file or processor and controlling only those affected parts and the effect is seen in real time and other flow files or processors will not be affected at all. Security: One of the concerning issues in other flow management systems is se- • curity. NiFi provides security in two forms which are: system to system and user to system security mechanisms. In the first, it enables encryption and decryption of each of the flows involved. And also when communicating with other NiFi in- stances or other systems, it enables to use encryption protocols like 2-way SSL. For the second, i.e user to systems, it provides 2-way SSL authentication and also control user’s access levels by having privilege levels as Read Only, Data Flow Manager (DFM), Provenance and Admin. Dynamic Prioritization: NiFi has queuing mechanism that enables it to retrieve • and process the flow according to the specified queue prioritization schemes. It can be based on the size or the time and it even allows making custom prioritization

15 schemes. The need for prioritizing the queues arises because of constraints in band- width or different resources or how critical is the event. This is helpful to set the priorities according to the required properties or needs at hand because the priori- ties set at one time may not be good enough for other times and it will affect the decision if not set properly; hence NiFi allows dynamic priority setting for different scenarios according to the need.

Data Provenance: Data Provenance is one of the most important features of • NiFi that enables to manage and control the flow of data from beginning to end by automatically recording each performed action. From the data provenance page, it enables the user/DFM to see what happen to the data, where it comes from, and where it goes, what is done with it and so on. This is useful when problems occur because it increases traceablity and help track the issue. It also enables to see the lineage or flow hierarchy of the data.

Extensibility: Another feature that NiFi provides is Extensibility to its various • components such as Processors, Reporting Tasks, Controller Services and Prioritiz- ers [8]. This is useful because it enables various users or organizations to design their own extension points/components and embed it in NiFi to gain a better service in their own specializations. One of the most extensible components that are widely used is the processor. Most organizations design their own processors to ingest or egress data to use it with NiFi. For example,in IoT applications data is produced from different devices with different formats. And this different data formats need to be utilized to process and gain insight from them; for this NiFi’s extensibility feature can be used to design processors that ingests these different formats to NiFi where eventually its inbuilt processors process the ingested data according to the need. This makes Extensibility as one of the key feature of NiFi.

3.1.3 NiFi UI components NiFi provides visual command and control for creating, managing and monitoring data flows. After the user starts the application, then writing the URL, https://:8080/nifi on a web browser brings a blank NiFi canvas for the first time. The is the name of the server or the address that NiFi instance is running on. And 8080 is the default port number for NiFi. The points below show the different components of the UI as in Figure 3.3.

Default URL address: As shown in the Figure 3.3, since the machine is running • on local, the hostname is “localhost” and with a default port number 8080 which can be changed in the “nifi.properties”file in the NiFi directory.

System Toolbar: NiFi has about 4 system toolbar namely Component, Action, • Search, and Management toolbar as shown in the Figure 3.3.

– Component: consists of the different components such as Processors, Input and Output Ports, Process Groups, Remote Process Groups, Funnel, Tem- plates, and Label.

16 Figure 3.3: NiFi UI canvas

– Action: consists of buttons to perform Actions on a particular component. Some of the actions are Enable, Disable, and Start if the process is not started or stopped, Stop if the process is started, Copy to copy the particular compo- nent, Group to group different components together and so on. – Search: consists of the search field to search components existing on the canvas. – Management: consists of buttons used by different users (DFMs, Admin) according to their privilege levels. It includes Bulletin boards, Summary page, Provenance, and so on.

Status Bar: From the Figure 3.3 above, the Status Bar include the Status and • Component Info labeled on the figure. The Status shows the active threads that exist if threads are being used; it also shows the total number of queues on the flow file between different components. It shows the clusters existing and how many nodes are connecting and the timestamp as last refreshed time. The Components Info shows as to how many processors or other components are running, stopped, invalid, or disabled and so on.

Navigation Pane and Bird’s Eye View: The navigation pane NiFi provides • enables to navigate, zoom in, and zoom out the components in the canvas. And the Bird’s Eye View allows the user to view the data flow easily and quickly.

3.1.4 NiFi Elements NiFi has different elements and some of them are further discussed in the subsections that follow and Figure 3.4 shows the main components that it supports.

1. User Management: NiFi provides a mechanism for user management and control- ling privilege accesses. It supports user authentication either by client certificates or using Username/Password mechanisms. Authenticated users use HTTPS for ac- cessing data flows in a browser. In order to use username/password mechanism,

17 Figure 3.4: NiFi main components

login identity provider and “nifi.properties” file need to be configured for the con- figuration file, and the provider should be set to indicate which provider should be used. i.e

nifi.login.identity.provider.configuration.file • nifi.security.user.login.identity.provider • Likewise for controlling access levels, NiFi provides pluggable authorization mech- anism that enable users to have access to the system and assign different roles. For this, “nifi.properties” file is configured for these two properties:

nifi.authority.provider.configuration.file - specify configuration files for autho- • rization providers nifi.security.user.authority.provider - which provider to use from the configured • ones

It also provides Roles for controlling authorization. Next shows some of the Roles that it provides. And users can have different Roles assigned to them.

Administrator: configuring user account and the size of thread pools • Data Flow Manager (DFM): manipulate the data flow as designing, in- • gesting, routing, ... Read Only: only view the data flow but not allowed to change • Provenance: able to query provenance repository and view lineage. Able to • view and download the content of the flow file. Not able to Replay the flow files in case of failure or during troubleshooting.

2. Processor: Processor is the main building block of NiFi data flow. It is responsible for data ingesting from other systems or NiFi instances, routing, transforming and

18 finally output the data in to other systems. It is also the main extension point that can be designed to enable organizations to input/output their flow files using NiFi. Figure 3.5 below describes its Anatomy:

Figure 3.5: NiFi Processor Anatomy

Processor Type and Name: As the name implies the Processor Type spec- • ifies the type of the processor used and it is “PutFile” processor used in this example. This processor is responsible to write flow files to a disk. The name of the processor is bold; by default it takes its type name as a name and also allows renaming it in the settings tab of configuration page for the processor. So in this example, the name is “Save Matched Tweets” which is a Put File type processor that stores matched tweets to the disk. Status Indicator: This is the icon at the left top corner of the processor that • shows the current status of the processor. There is different status indicators available based on the validity of the processor. – Running: this shows that the processor is running. It has a green play Icon. – Stop: it shows that the processor is currently stopped and it has a red icon – Invalid: shows that the processor cannot be started because there are missing properties that need to be set. The missing properties can be seen by hovering over the icon. Its icon is a triangle with exclamation mark inside it. – Disabled: this shows that the processor is disabled and cannot be started until enabled. Flow Statistics: this shows the statistics happening on the data flow over • the past 5 minutes. They are In, Read/Write, Out, Tasks/Time fields that shows the number of flow files and the total size entered/ingested in to the processor; the total size of the flow file content read from disk and written to the disk; and then the number of flow files and the total size of the flow file content that are transferred to the next processor/component; and the number of tasks this processor perform and the time it takes to perform it over the past 5 minutes respectively.

19 3. Input/Output ports: Input Port is one of the components of NiFi and it is used for transferring data coming from other components or systems into different Process Groups. Output Port is also used for transferring data from process groups to destinations outside of a Process Group or other components/systems such as Apache Spark.

4. Process Group and Remote Process Group: Process Group is another NiFi component that logically groups set of components that makes easier for mainte- nance. It prompts the user for unique name and provides some kind of abstraction. Remote Process Group (RPG) on the other hand has same idea as that of Process Groups but this is to connect other instance of NiFi remotely. It asks for the URL of the remote instance than a unique name so that the connection is created be- tween RPG and the NiFi instance. It uses Site-to-Site communication protocol to communicate with remote instances or other systems.

5. Template: Template is another component of NiFi where it enables re-usability of the components created inside the templates. It enables users to create Templates, export them in an XML format to be used in other NiFi instances. Then this Template can be imported into other NiFi instances for usage. So it is a feature that makes re-usability possible for NiFi data flows.

6. Funnel: Funnel is another component that is used for combining different com- ponents or processors into one and makes prioritizing easier. If there are data flows with many processors, setting priorities at each of the processors is against performance and NiFi provides the possibility to set priorities and change them dynamically at a single point i.e in Funnel.

7. Provenance and Lineage: Data provenance is one of the key features as well as element of NiFi that keeps a very great detail of each data that it ingests. It has provenance repository that stores everything that is mapping to the data from beginning to the end such as ingesting, routing, transforming, cloning, etc... This means that everything that passes through NiFi is recorded and indexed which makes it easier for searching, tracking problems that occur easily and provide solu- tions and also monitors the overall data for compliance. There is a provenance icon in the Management toolbar at the top right corner in the NiFi UI and it displays everything that has happened in the data flow. It enables to search and filter by Component Name, UUID and Component Type. When the “View Details” Icon is clicked, it displays the details of that particular event which has 3 tabs as in Figure 3.6: Details tab which lists the time, type of event, UUID and so on. Attributes tab lists all the attributes that exist at the time the event occurred with their previous values and the Content tab enable to download or view the content. NiFi also provides a possibility to see the provenance data for each processor by right clicking on the processor and choosing Data Provenance. Twitter data analysis is used as an example that searches all the tweets having “InternetofThings” phrase and it loads the language, location, text, username and so on according to the properties set. And Figure 3.6 below shows the provenance data for it. In the right side of the provenance page, there is an icon for showing

20 Figure 3.6: NiFi Provenance lineage, “Show Lineage” which shows the graphical representation in details of what happen to the data. It enables to see details, parents and expand the particular event occurred as it is needed. It has a Slider that enables to see which event is created at what time and the time it takes to create by dragging the slider. It also enables to download the lineage graph as shown in Figure 3.7.

Figure 3.7: NiFi Lineage

21 3.2 Apache Spark Streaming

Apache Spark is an open source fast and general engine for large scale data processing [28][29]. It was originally developed in AMPLab in UC Berkeley, California [29] and currently is a top level Apache project. Spark’s core abstraction is called Resilient Dis- tributed Datasets (RDD) which is an immutable collection of elements. Apache Spark is a main API with different components comprised of Apache Spark SQL, MLlib, GraphX and Spark Streaming.

Spark SQL :- is one of the modules in the Spark general core API that enables • the user to work with traditional structured data [30].

GraphX :- is another Spark API for graphs and graph-related operations [31]. • MLlib :- is a Spark API for machine learning which consists of various kinds of • machine learning algorithms [32].

Spark Streaming :- is a Spark API mainly dealing with computations or analysis • of live streams of data flowing each and every specified time [33].

Apache Spark Streaming is one of the components of the Spark core API that uses the streams of data as micro batches to process them. It is also possible to use other compo- nents from Spark API such as MLlib and Spark SQL together for further processing.

3.2.1 Key Features As one of the components in the Spark API, Spark Streaming shares main features that Spark provides and adds others on top of it. Some of the main features are listed below.

Spark Streaming provides high level abstraction called Descritized Streams (DStream) • which are built on Resilient Distributed Datasets (RDDs) - Spark’s main abstrac- tion.

It makes integration of streaming data with batch processing easy because it is part • of the Spark API.

It receives data from different sources such as HDFS, Flume, and Kafka; it also • enables custom made receivers.

It supports use of different programming languages such as Java, Scala, and Python • Fault Tolerance - it has “exactly-once” Semantics which make sure that data is not • lost and reached exactly one time avoiding duplicates which also is advantageous for data consistency.

Provides Stateful transformation that maintains the state even if one of the nodes • fail hence good for fault tolerance.

Speed - performs in-memory computations that have low latency and provide faster • processing speed than those performed on disks.

22 3.2.2 Basic Concepts and Main Operations Basic Concepts The main programming model for Spark Streaming is its abstraction called Discritized Streams (DStream). DStream is a continuous stream of data which internally is repre- sented by Resilient Distributed Datasets (RDDs). DStream can be created by ingesting data streams from different sources such as Kafka, Flume, Twitter, or it can also be created by different transformations on other DStreams. RDD is Spark’s main abstraction point that consists of fault tolerant collection of elements which can be executed in parallel [34]. Figure 3.8 shows that DStreams as a continuous

Figure 3.8: Continuous RDDs form DStream - source [33] stream of batches of RDDs at a specified time interval. When all these batches of RDDs are combined, they form a DStream. There are different transformations supported by DStreams similar to the RDDs on Spark API. These transformations allow the data from input DStreams to be modified. Examples of such transformation functions include map, filter, and reduce.

Main Operations Spark Streaming also provides various kinds of operations on DStreams. The main oper- ations are Transform, Window, Join, Output operations [33]. Transform and Join :- The Transform operation allows RDD-to-RDD operations • over DStreams such as joining data stream with other datasets. Spark Streaming enables different DStreams to be joined with other DStreams. There is stream- stream joins which enables streams of RDDs to be joined with streams from other RDDs. And also stream-dataset joins which enables streams to be joined with datasets with transform operation [33].

Window Operation :- Since the live streams of data coming from various sources • are continuous, they cannot be computed as batch of files and traditional operations cannot be performed on them. Spark Streaming provides a solution for this by a Window operation that enables these streams of data to be processed, transformed, and computed within a range of specified time interval over a Sliding Window. Ev- ery Window operation must specify Window Length and Sliding Interval to perform its actions over a Window [34]. Window Length is the time for the total Window while Sliding Interval is the rate or interval at which the operation is performed. There are many Window operations that Spark Streaming supports such as window, countByWindow, reduceByKeyAndWindow and so on.

23 Output Operation :- Spark Streaming supports many output operations that • make sure the processed streams and data are stored in an external storage such as HDFS, or file systems, databases, and or even displayed on live dashboards. Print, saveAsTextFiles, saveAsHadoopFiles, foreachRDD are some of the output operations that Spark Streaming provides.

3.2.3 Architecture How Spark Streaming operates can be summarized as:

Receiving the input - the sources could be from Kafka, Twitter, log data, etc... and • divides them into small batches

Spark Engine - Processes the data received from Spark’s memory • Output batches of processed data to storage systems • The tasks are assigned dynamically to the nodes based on the available resources which enable fast recovery from failures and to have better load balancing between the nodes. Its ability to divide the input streams into small batches enables it to process the data in batches and reduce latency it takes to calculate if taken one by one.

Figure 3.9: Spark Cluster - source [33]

In addition to this, Spark Streaming runs on a Cluster as in Figure 3.9. The Main program in a Spark Cluster (also known as Driver program) has Spark Context that coordinates Spark application to run on a cluster. The first step is creating a connection to a Cluster Manager available which allocates resources to individual applications. Once connection is created, Spark acquires Executors on Worker Nodes which in turn run application code in the Cluster. And then Spark sends the code to Executors that are able to run tasks and keep the data in memory or disk storage. Finally, the Spark Context sends the tasks to be run. The Clusters available include Hadoop YARN, Apache Mesos and it can also run on Standalone Mode.

24 3.3 Apache Storm

The other tool that is going to be studied in this chapter is Apache Storm. Overview of the tool, what features it has, the components, and main use cases will be briefly studied.

3.3.1 Overview Apache Storm is a distributed, resilient, real-time computation system [35]. It was de- veloped by Nathan Marz and became open source on September 2011 [36]. It works in much more similar ways to Hadoop except that Apache Storm is for real-time streaming data while Hadoop is for batch processing.

3.3.2 Basic Concepts and Architecture In this subsection, the different components and concepts of Storm will be discussed and its Architecture is also presented.

Basic Concepts Tuple :- is the primary data structure in Storm which is a list of values that • supports any data type [37].

Streams :- is a core abstraction in Storm by which these unbounded tuples form a • sequence or a stream. It can be formed by transformation of one stream to another. It has primitive types such as long, string, byte arrays and also supports custom types to be defined by users but by implementing their own serializers.

Spouts :- are the main entry point of streams for Storm. Different external sources • such as Kafka and Twitter API ingest their data through Spouts. Spouts can be Reliable where replaying the lost tuple is possible if failures occur and Unreliable where replaying is not possible and the data will be lost.

Bolts :-is where the main processing takes place. It takes in inputs from Spouts and • processes it where finally the processed tuples will be emitted to downstream Bolts or some storage to databases. The processing part includes stream transformation, running functions, aggregating, filtering, or joining data or sending it to databases.

Topology :- is the main abstraction point of Storm. It is a network of Spouts and • Bolts which are connected with stream groupings. So each node of a graph/network represents either a Spout or a Bolt and the edges represent which Bolts are sub- scribed to which component i.e Spout or Bolt. In Figure 3.10, the nodes are spouts (S1, S2) and bolts (B1, B2, B3, B4). So B1, B2, B4 are subscribed to streams coming from S1. B4 additionally is subscribed to streams coming from S2. This shows that in a Topology, the tuples are streamed to only the components that they are subscribed to.

Trident :- is an API which is part of Storm that is built on top of it. It supports • “exactly-once” Semantics.

25 Figure 3.10: Storm Topology - source [38]

Stream Grouping :- Storm has different inbuilt stream groupings and also sup- • port custom made stream grouping. The main stream groupings include Shuffle Grouping and Field Grouping. Shuffle Grouping randomly distributes the tuples among the tasks of the Bolt while Field Grouping groups the tuples having same field name [38].

Task :- refers a thread of execution. • Worker :- executes the subset of all the Tasks existing in the Topology. • 3.3.3 Architecture Storm supports both local mode and remote modes of operation where the local mode operation is mainly useful for developing and testing topologies and in remote mode; topologies are submitted for execution in a cluster [38]. There are two kinds of nodes on a Storm cluster i.e Master Node and Worker Node. The Storm architecture has three main components namely Nimbus which is a daemon that runs on Master Node, a Supervisor which also is a daemon running on Worker Node, and a Zookeeper that mainly handles communication between Nimbus and Supervisor as shown in Figure 3.11. Their functionality is summarized in Table 3.1:

26 Figure 3.11: Storm Cluster -source [38]

Nimbus Supervisor Zookeeper Assign tasks to Receives the work assigned Handles the communication worker nodes to its worker between Nimbus and Super- visor Monitor for fail- Start and stop worker nodes Keeps the state of the topol- ures as required ogy Distribute code among cluster components

Table 3.1: Storm Architecture Components Functionality

3.3.4 Features The features of Storm also show its advantages and why it is popular nowadays for stream data processing. Some of the main features include: Reliability :- It provides guaranteed message processing by using “exactly-once” • from the Trident API or “at least-once” Semantics from the core Storm. It also makes sure that specific messages will be replayed in case failure occur on those specific messages. Fast and Scalable :-Supports parallel addition of machines horizontally and scales • fast with increasing number of machines. Fault-Tolerant :- Failure in Storm occurs for example when the worker dies or • when the node itself dies. For the first case, the supervisor handles the failure by automatically restarting the worker while for the second case; the tasks will time-out and be assigned to other machine or node. Support for many Languages :- Storm uses a Thrift API that makes it possible • to support many programming language such as Scala, Java, Python and etc...

27 Chapter 4

Review and Comparison of the Tools

In this chapter, the tools that are studied on the previous chapter are further reviewed and then compared based on some selected parameters. The parameters are not selected based on any particular model but rather from the characteristics of the tools. It is important to answer questions like:

Which tool is more preferable if one parameter is more wanted than the other • What would be the complexity if we use this tool for such and such cases • How each tool responds to the parameters specified • The selected parameters include:

(i) Ease of use (iv) Queued data/ data buffering (ii) Security (iii) Reliability (v) Extensibility

4.1 Review

4.1.1 Apache NiFi Ease of Use : NiFi’s Ease of use comes with its friendly drag and drop User Inter- • face where it controls the activity and the flows. If we have more complex data flows with different types, handling them from the command line is very complex and would not provide any good detail. But NiFi solves this issue by allowing all the flows to be designed in a UI which solves complexity issues and allows fast recovery from problems making maintenance easy. Another feature that makes NiFi easy to use is its flows can be changed and customized on the fly without affecting other parts of the flow. It also accepts data from variety of sources in different formats such as FTP, HTTP, XML, JSON, CSV, and different File Systems. This is also another feature that makes it to be used easily.

28 Security : NiFi has inbuilt security and supports different security schemes both • at the user and system level. It allows each data flow to be encrypted/decrypted by providing processors which provide this. It provides both certificate and user- name/password authentication mechanisms. It does this by 2 way SSL authentica- tion where a specific user is allowed to access if the certificate it uses is legitimate by sending acknowledgment between the client/browser and the server. It also has access level authorization scheme where users are assigned different Roles. This is important for use cases where security is more needed such as financial, governmen- tal, and similar sectors.

Reliability : Reliability of a system is its ability to function properly for its in- • tended purpose without failure. It includes the ability for providing guaranteed delivery of the processes at hand. NiFi is a reliable system and provides this fea- ture by using the Content Repository and the Write-Ahead-Log (WAL) mechanism where the content of data is stored first in the log files before they are written to the disk. And hence if problem occur, then it is possible to get the data from the log files without affecting the flow.

Queued Data/Buffering : Data buffer is a physical memory area where data is • temporarily stored. Queuing of data occur because the data is not processed at a given time, or the node failed. So this queued data has to be put in some memory as a data buffer. But it takes memory space if the data that is queued is always kept and there has to be an efficient way to handle such cases where resources are not exhausted. With this regard, NiFi provides buffering of queued data in an efficient way where all the queued data is kept in memory. It has back pressure mechanism where certain limit for processing data is specified and if that limit is reached, more data will not be processed until the queued data is processed and memory space released. By providing these features NiFi handles queued data in an efficient way.

Extensibility : The extensibility feature of NiFi comes with various uses. It • has many extension points where the user is able to design such as in processors, reporting tasks, and controller services to mention a few according to their needs. Flow files can be changed in real-time without affecting other flow files. There is no need to recompile the whole flow because if new flow files is created or the old removed, its effect is seen from the UI in real-time without compilation.

4.1.2 Spark Streaming Ease of Use : Spark Streaming Ease of Use feature comes with its core API • Spark that has APIs for different programming languages. It has support for Scala, Java, and Python. It could be useful for users who are familiar to the languages mentioned and shows that it is flexible and addresses more users as the language it supports increases. It also has interactive shell and supports different APIs.

Security : Spark supports authentication through Kerberos security and using • Shared Secret [39]. Using the Kerberos authentication requires creating Principal and Key tab File and configuring Spark history server to use Kerberos. Only Spark

29 running on YARN cluster supports Kerberos authentication and it does not allow in standalone mode. The second type of authentication is using Shared Secret where ahandshakebetweenSparkandtheothersystemismadetoallowcommunication between them. In order to communicate, both must have the same shared secret key. For this authentication to work, “spark.authenticate” parameter must be configured to true.

Reliability : Spark Streaming is a reliable fault-tolerant computation framework • where data processing is guaranteed. It uses different mechanisms to address fault tolerance or guaranteed delivery of data such as Exactly-Once delivery semantics and Write Ahead Logging (WAL). Exactly-Once semantics is one form of delivery semantics where data is processed exactly one time. It does not allow duplicates to be formed. Failure may occur in two forms which are node or executor failure and driver/main program failure. When the node fails, it is automatically restarted and the normal operation continues because the data blocks in the receivers are replicated. Once the data is ingested to the node, it is guaranteed that it will be processed. When the driver/main program dies, all nodes fail and received blocks failed. If DStream checkpoint is enabled, then it is possible to restart the main program from the last checkpoint. And then all executors will be restarted. DStream check point is a way to specify fault tolerant directory such as HDFS to regularly store the status. Failure may also occur when input data is being loaded. When this happens, Spark Streaming recovers some of the data but not whole. So the solution provided by Spark to recover all the files is using WAL where the data ingested is written synchronously to fault tolerant storage such as HDFS, S3 before being processed. So if the data is received correctly, acknowledgment is sent and then the data will be processed. If acknowledgment is not sent, that means failure occurs so Spark reads the log files and the data will be sent again for being processed from there. So all these methods used make Spark a reliable and fault tolerant processing framework.

Queued Data/Buffering : In real-time data processing, queue is created when- • ever the data is not processed in a specified time interval and the processing time is slower than that of the rate at which the data is received. So this data will be queued in a buffer and keeps increasing if it is not processed or removed. In Spark Streaming also, the data will be queued as DStreams in memory and the queue will keep increasing. So in order to overcome this, Spark Streaming provides a way to set configuration parameters which help to limit the rate at which data is received and processed. It also uses other methods such as reducing batch processing times or having right size for the batches so that they can be processed by the rate they are received.

Extensibility : When new application code existed and if it is needed to replace the • old application code, Spark Streaming provides two ways in which one is shutting down the existing system gracefully and starting the new application which starts processing from the point the earlier application left off. The other method is the new application started in parallel to the existing one and later the old one is shut down.

30 4.1.3 Apache Storm Ease of Use : Storm’s ease of use comes with its easy to use API and ability • to support different programming languages through its Thrift API. Thrift API is an Interface Definition Language and also a communication framework that allows defining new data types that support different programming languages [40].

Security : Storm for this particular release (0.9.6) does not provide inbuilt security • (authentication and authorization). It does not provide encryption of the data over the network by itself. This means that security is mainly dependent on other tech- nologies outside of Storm such as firewall settings and encryption on different parts such as Topologies. The latest release of Storm supports Kerberos authentication by creating key tabs and principals for the daemons.

Reliability : Storm provides guarantee for full data processing even if any of the • connected nodes in the cluster dies or messages are lost. Full data processing is a way when all the messages in the tuple trees are fully processed in a specified time interval, otherwise it fails. It guarantees this processing by providing at least- once semantics which guarantees messages are replayed when failure occurs and allows duplicates to be formed. It also uses Trident API in occasions where exactly once processing of data is needed. There are different points of failure such as node failure, worker failure, or the daemons failure (Nimbus and Supervisor) and are handled differently to provide fault-tolerant system. When the worker dies, it is automatically restarted by Supervisor and nothing will be lost. If the node dies, Nimbus will assign the task to other machines because tasks assigned to that machine times out. If the daemons die, worker processes are not affected and will continue when they restarted.

Queued Data/Buffering : Storm provides techniques to overcome over queuing • when data is over queued or stays in the buffer for too long without being processed. If the incoming data is not processed with in specified time, the buffer begins to fill up with messages and grows too much. This results the timeout for processing the task to reach which causes messages to be re-emitted again at the spout. So Storm provides back pressure by which a threshold on a number of messages to process is predefined and other properties set. When this threshold is reached, it blocks some of the messages until those in the queue are first processed.

Extensibility : New code that is written after the last deployment needs to be • recompiled to incorporate it and use it.

31 4.2 Differences and Similarities

This section summarizes selected parameters and how each tool behaves towards the parameters. The intention is to show differences, similarities and their applicability in certain use cases and it is shown in Table 4.1.

NiFi Spark Streaming Storm Web UI Has friendly UI Interactive shell, UI UI for monitoring for monitoring Main element Processors DStreams Spouts, Bolts, Topolo- gies Language Has its own Expres- Java, Scala, Python Any prog.language sion Language Reliability/Fault- Content Repository Exactly-Once seman- At least-once and tolerance and Write Ahead Log tics, Immutable RDDs exactly-once with Trident Security Inbuilt security, SSL, Shared secret, Ker- No authentication SSH, HTTPS, content beros authentication (0.9.6), Kerberos encryption support for later release Applicability/Use Good for Simple pro- Good for Simple + Good for Simple + case cessing complex processing complex processing

Table 4.1: Differences and Similarity of the tools

Simple processing:- includes extracting, splitting, merging, filtering, routing, ETL trans- formations etc... of the data. Complex processing:- includes massive computations, aggregations, window and join op- erations, machine learning computations, etc...

4.3 Discussion of the parameters

In this section, discussion of the parameters for each tool is given. This discussion is done with the intention that it can be used to guide a reader to decide which tools to use considering the list of parameters listed i.e combination of the parameters.

Ease of use/Usability : Ease of use is one of the main features to consider in • every software system. It defines how easy is to use such systems effectively and efficiently. It could come in terms of the programming language the tool support for example by being accessible to large users in different languages. It could also be in terms of the way it solves the complexity of a system, i.e whether it has easy techniques such as UI or with support of many Scripts and logs from the command line. Being said that, Spark Streaming supports different programming languages such as Scala, Java, and Python through its language integrated APIs. It

32 addresses large audience/users familiar with these programming languages. Apache Storm on the other hand even supports any programming language through its use of Thrift API. This is important because any person who is familiar with certain programming language can do the work without needing to learn specific language with little or more configuration and technical changes. When we consider Apache NiFi, it can be designed with little or no coding and has its own expression language which enables to use different functions and regular expression in different formats. It is easy but new kind of language so if learning curve is considered, since it is different than common programming languages, it takes time to know and utilize it better. Generally when we consider ease of use in terms of languages used, even though Storm and Spark Streaming supports many languages and reaches large number of users, NiFi’s easy to use feature is important to consider when making decisions on which tool to use. When considering ease of use in terms of solving the complexity, Spark Streaming mainly uses interactive shell for processing the data and UI for monitoring the cluster environment, memory usage, and information about executors and so on. But it does not enable to control and monitor the flow from the UI. Storm also uses the UI for monitoring not for the flow of the data but other information such as memory usage and so on. It basically uses running scripts or code applications. NiFi on the other hand has a friendly UI where it shows beyond monitoring and controlling of the information, cluster environment. It enables command and control where the user is allowed to view the flow of files in real-time. It is possible to design the flow and see the effect right away without affecting other flow files. This feature is important when there are many flow files in complex way which cannot be handled otherwise in Interactive shell or command line interface. Having such capability i.e providing a user friendly UI that makes control of the flow easy and enable to see problems in real-time and make right decisions there. But using the command line and many scripts and code will not fasten decision making when problems occur that needs to be handled right away. Considering this feature, using NiFi provides easy and effective data processing while controlling the flow with friendly UI.

Reliability : Reliability is one of the main features to consider in real-time data • processing. All the three tools handle reliability in various ways. Spark Streaming uses DStreams which are continues RDDs and RDDs are immutable which means fault tolerant. Spark Streaming provides reliability through exactly-once delivery semantics which guarantees that the data is delivered exactly one time with no duplication and it also uses Write Ahead Logging. Storm provides atleast-once delivery semantics which guarantees the data is delivered at least one time and duplication exists. It provides exactly-once semantics if Trident API is used. NiFi provides reliability by using its content repository and Write Ahead Logging where every action is written to log files before it is written to disk. So it is a design choice as how the data should be handled to choose the tools i.e if duplicate data is possible in simple processing, then Storm with atleast-once guarantee is good to use or if the situation is banking transaction where cash withdrawal and such occur, then exactly once delivery is the one to choose in Spark or Storm with Trident.

Security : Security is also important feature to consider in real-time processing. • 33 Spark Streaming through its core Spark API provides shared secret authentication where communication is after the handshake between systems. It also supports Kerberos authentication only on YARN cluster. Storm for 0.9.6 release version does not provide authentication and security is handled by external firewall. For release 0.10.0 and above, it supports Kerberos authentication. NiFi comes with different alternatives when considering security. It provides certificate authentication, and username/password authentication. It also provides pluggable authorization where it provides Roles for different users. It also allows encryption/decryption of the flow files. So if the use case needs more security even in the flow files, for example, in abankenvironmentwherethetransactionhastobehighlysecured,thecontents flowing may be needed to be encrypted for some users based on Roles and seen for others (able to decrypt it). In such cases NiFi is good tool to choose.

Extensibility : Extensibility is a feature that can be described in terms of addition • of new features/components or functionality and also modifications to the existing systems. When considering addition of new features/components, Spark Streaming uses the core Spark’s extensible API feature which supports SQL, MLlib modules. So this feature of Spark allows its Streaming API to use one or more of these modules which makes it extensible. Storm is designed to be extensible for using external functions such as SQL features. For that it uses its Topologies and other APIs. NiFi is also designed to extensible where its main components are extensible. Processors and Reporting Tasks are some of the main elements of NiFi that are extensible. So it is possible to design your own Processors that are capable of achieving your purposes and also using NiFi’s already existing Processors to modify and transform your data. For such use cases such as where data from sensors in IoT applications needs to be transformed from one form to another for processing, choosing NiFi would be good considering this feature. Extensibility could also come in terms of addition of new functionality which is in the flows you want to change or the application code you written. Spark Streaming and Storm follows the same ways where first the new application code has to be tested and then deployed either in parallel as the existing one or first shutting down the existing one and starting the new one. This way would not produce good results when considering real-time decision making and would not allow tracing debugs that are occurring in real-time. So it has some downtimes when something needs to be changed in the application code. In NiFi, adding new functionality could be adding/removing Processors or other components to/from the flow. Since it does not have save and deploy methods, the effect is seen in real-time in the UI, what has been added or removed. And if problems occur while making that change, then it is traceable and is solved right away. So NiFi is good for such use cases where decision making, viewing the flow and trace debugs is vital in real-time.

34 4.4 How each tool handles the use case

In this section, a use case that is going to be used as a benchmark for practical analysis is defined. And how each tool handles the use case is depicted theoretically in brief. This theoretical study and comparing of the tools is important to choose the tool that is going to be used for the analysis. There are two data sources from which data is ingested to the system. One is real- time source Twitter API for receiving tweets and the other is static data which retrieves historic tweets from NoSQL database HBase. Then the two data sources are merged and filtered, and then the incorrect data is sent to log files in local systems and the correct data to HBase. Finally, some of the filtered data will be indexed in to .

Figure 4.1: General use case flow

(I) Apache NiFi : NiFi handles this use case in a simple and efficient way in a friendly UI. It comes with processors that handle interactions between NiFi instances and other sources and systems such as Twitter, HBase and Solr for this use case. It does this by providing inbuilt processors to get the data, process (extract, filter), and route to other processors and downstream systems. It also has processors to write incorrect files to some logs in local file systems or HDFS. Each processors used must be configured in the right way to function appropriately which enables to monitor and control the flow in real-time. It is also possible to use output ports which allow sending these flow files to external systems such as Apache Spark for further processing if needed.

Figure 4.1 above shows the general flow of the system for the use case. Setting properties for each processor and also using NiFi expression language, it is possi- ble to design and query simple ETL transformation, splitting, merging, filtering, extracting needed information which can further be used as an input to other sys- tems. NiFi uses the “GetTwitter” processor to receive the tweets from the Twitter API. Specific search terms can be set in the “Terms to Filter On” property of the

35 Figure 4.2: NiFi use case flow

processor. “GetHBase” processor is used to retrieve historic tweets from HBase. These two processors send the flow files to downstream system which extracts the required fields from the JSON file produced from them. For this, “EvaluateJSON- Path” processor is used which allows defining custom properties which can later be used in making routing decisions. After these data is merged and required fields are extracted, “RouteOnAttribute” processor is used which allows defining custom Boolean rules. These rules define whether the flow is correct or incorrect and is important in the next parts to route the flow accordingly. According to the rule, if the flow is correct, then it is sent to HBase for storage which is handled by “PutHBaseJSON” processor and some of the flow is also sent to Solr for indexing with “PutSolrContentStream” processor. The incorrect data is sent to “PutFile” processor used as a log. This use case is shown using NiFi in Figure 4.2.

(II) Spark Streaming : Apache Spark Streaming is based on DStreams which are small batches of RDDs in specified time. Figure 4.1 is used as a general use case diagram. Initially, Spark Context object is created taking the configuration ob- ject. Then SQL Context object is created by taking the Spark Context object as an argument and it is responsible for getting the queries from HBase and then stores them to temporary table, “tmpHBase”.ThenStreamingcontextiscreated by taking the Spark Context and Sliding window interval as an argument. Spark Streaming receives inputs from the Twitter API using the streaming context cre- ating a DStream, “twitterDStream”. Then Window operation is defined which takes the sliding interval and window length. Then a series of DStream trans- formations is done for splitting the tweets into separate words, “splitDStream”, filtering which again creates another DStreams, “filterDStreams”,andmappingof the data each time creating new DStreams with new transformations. After these, the last transformed DStream is stored as a temporary table, “tmpTwitter” where it is joined/merged with the previous temporary table, “tmpHBase” using the SQL Context created before and is continuously stored using foreachRDD method. Fi- nally, “saveAsHadoopFile” or “saveAsNewAPIHadoopDataset” method is used to

36 store the data. The general illustration is shown in Figure 4.3.

Figure 4.3: Spark Streaming use case flow

(III) Apache Storm : Figure 4.1 is referred as a general use case diagram and here how Storm handles this use case is described. It starts by creating Topologies where the Spouts and Bolts used are initialized. Then Spouts are created which are entries where data is ingested, two Spouts are first created for receiving data from two different sources because the sources are different and the way the data comes is in different forms. The data is coming from Twitter API and historic data from HBase, hence “twitterSpout” and “hbaseSpout”. In order to load data from HBase, first HBase connection has to be created to use it in Storm. Then these data is sent to the Bolt that is subscribed to these Spouts where merging of the data is handled, “mergeBolt”.OtherBoltwhichissubscribedtothefirstBoltarecreated for processing (filtering and checking texts, language and locations fields whether it is empty or not) the data, “processBolt”.Ifoneofthefieldsisempty,itmeansit is incorrect data then it is sent to the Bolt to write to log file in local file system, “incorrectBolt”. If the fields are not empty, then it is sent to storage systems, “persistBolt”. This use case is illustrated using Storm in Figure 4.4.

37 Figure 4.4: Storm use case flow

4.5 Summary

To summarize, in this chapter a thorough review of tools is done based on some selected parameters and then their differences and similarities is showed in a table as a summary to illustrate as to how each of the tools handle those parameters. Then discussion of the parameters is done to show which tool is good for particular use case. And in the last part, how each tool handles the use case is briefly discussed. All this is important to see the advantages and disadvantages of using one tool over the other given some parameters such as ease of use, integration with other external systems such as HBase and Solr which are used in this case. It is important to make decisions on the tools to use in the practical analysis and use it as a ground after studying and comparing the tools. Considering the use case defined which involves ingestion of data, processing (merging, extraction, filtering and routing) of data and finally persisting to storage systems and further analysis after indexing to Solr, this use case is not so complex so it does not require machine learning techniques or heavy computations and aggregations. With regard to this, all these tools are suitable to handle these use case defined. But NiFi has advantages over others in a way it provides advanced web UI which enables to design the flow, monitor, and control in real-time as the tweets keeps flowing to the system. It also allows the provenance of the data i.e the origin of the data, what happen to each data in real- time. It also has advantage in integration with other systems such as HBase, Solr, Spark and Storm. This is important if the data processed in NiFi needs to be transferred to other systems such as Spark for further advanced analysis. It uses inbuilt mechanisms called “Site-to-Site” protocol for this purpose. So generally, NiFi is a kind of tool to use for such use cases where it makes easy, reliable, fast, and efficient processing of data which can be again transferred to other systems. So the tool that is going to be used for analysis is NiFi and further analysis is also made on Solr.

38 Chapter 5

Practical analysis/Twitter Data Analysis

This chapter will cover the practical analysis of the project. In the previous chapters, different tools were studied that are popular in processing real-time data. In this chapter a practical analysis of Twitter data will be made using Apache NiFi - one of the tool studied in the previous chapters. The next sections introduce Problem definition, formulation and the question it tried to answer, the setup for the analysis and the main analysis parts.

5.1 Problem definition

The use case chosen for this purpose is analysis of Twitter data in real-time. This use case is chosen to demonstrate how NiFi communicate with other data sources and systems such as Twitter, HBase and Solr through its inbuilt processors. The use case is summarized as:

1. Source A - real-time tweets from Twitter

2. Source B - static data from HBase

3. Combine sources and extract required fields

4. Filter out incorrect data to log file - define a rule to filter correct/incorrect data

5. Write all data to an HBase table

6. Write some of the transactions to a Solr index - based on the rule defined

This use case tries to answer questions such as: how NiFi can be used with other sys- tems, how is the rules formulated to extract, filter and route flow files to their respective downstream connections, the top languages, the locations of the tweeters, top tweeters and so on. Apache NiFi is the main tool used for processing and analysis of the twitter data from ingesting the tweets to extracting useful fields, filtering and making routing decisions based on some properties defined. After the analysis, the data is persisted to HBase and

39 some tweets are also indexed using Apache Solr. Finally Banana which is a visualization tool working with Solr is used to visualize the analysis in real-time.

5.2 Setup

The analysis is made both on a Windows 10 machine with 8GB memory running Apache NiFi 0.6.0 in a local mode, Apache Solr 5.5.0 in a standard mode locally and Banana 1.6 for visualizations. The project is also deployed on Amazon Web Services (AWS) cluster on Centos machine running HDP 2.4 and HDF 1.2 versions respectively. HDP 2.4 is a Hortonwork Data Platform consisting of the major Hadoop components where as HDF 1.2 is Hortonwork Data Flow platform powered by Apache NiFi. HBase is part of the HDP 2.4 and Apache Solr is installed in Standard mode separately.

5.3 Analysis

In this section, the steps that are followed for analysis of the tweet data are given. As in any other data processing framework, the process starts with data ingestion in to the NiFi system. Then data processing will be done and then finally it will be persisted to storage database. Figure 5.1 shows this graphically.

Figure 5.1: Data Analysis flow

The data is ingested from two sources i.e real-time and static sources. The real-time data is coming from Twitter while the static source is historic tweets that were stored initially. Once these data are ingested to the flow, then processing of the data continues which includes extracting the required fields since the tweets come with whole lots of fields but it is interesting to consider only some fields. Then filtering and defining custom properties is done for making routing decisions. After all this steps, the data is persisted to storage systems and some of the data is also sent for indexing. The static data from the storage system is also sent back as an input source to Data Ingestion step.

40 Prerequisite : In order for using the needed processors, some prerequisites should • be properly configured. For example, for using GetHBase and PutHBaseJSON processors, HBase 1 1 2 ClientService has to be configured in advance. The pre- requisites are found in Controller Services. Controller Service is one of the important functions that NiFi provides. It bundles awholelotofservicestobeconfiguredandusedrepeatedly.Oncetheservicesare configured and set, NiFi allows using them repeatedly for many clients in the same instance without further configurations. It has so many services such as DBCP Connection Pool where once set, can be used for many database connections. It also has HBase 1 1 2 ClientService where configuration files are specified and once set can be used for many clients running HBase for reading and writing data. In this project, HBase 1 1 2 ClientService is used and the path for “hbase-site.xml” and “core-site.xml” files is specified for it to properly work. After this the “GetHBase” and “PutHBaseJSON” processors can be used repeatedly in this instance because the common configuration is handled by the Controller Services.

5.3.1 Data Ingestion The tweets are fetched from the Twitter API and then loaded to the NiFi flow through the “GetTwitter” Processor. This Processor has mandatory configuration properties that need to be set before starting the flow. The properties are shown in Table 5.1.

Property Description Twitter Endpoint Specifies the Sample Endpoint and Filter End- point. Filter Endpoint has to be specified if terms to search for are specified. Otherwise to get all the public tweets Sample Endpoint is used. Consumer key and Provided by Twitter API when creating the appli- Consumer Secret cation Access Token and Provided by Twitter API when creating the appli- Acess Token Secret cation

Table 5.1: Mandatory properties for “GetTwitter” Processor

It also has other properties such as “Terms to Filter On” where terms to filter can be specified. For this project it is decided to filter based on the terms: “IoT, Inter- netofThings, and BigData”.Oncethosepropertiesareproperlyset,thenthisprocessor is ready to start fetching data from the API. The other input data is from the NoSQL database HBase where historic data is ini- tially stored. For ingesting this static data from HBase, “GetHBase” Processor is used which is an inbuilt processor used for reading historic data from HBase. It uses the HBase 1 1 2 ClientService which is set once and used many times by different HBase clients. The other mandatory property is for specifying the table name. In this project, the table name is “Twitter”.

41 5.3.2 Data Processing This step has different parts starting from extracting required fields to filtering, separating the correct data from the incorrect based on rules defined and finally routing the data based on the rules set.

Extraction of Required Fields : As the twitter data is unstructured consisting • of variety of types and has many fields, it is not interesting to have all these fields for analysis. So this step is important to extract only the required fields to make further analysis on them. The Twitter data comes in JSON format and NiFi provides inbuilt processor called “EvaluateJSONPath” which is used to extract required fields from the JSON format data by allowing to define custom properties. And this property names defined are used when making routing decisions or in other processors. The fields that are interesting and used for extracting the tweets are shown in the following table, Table 5.2.

Property Name Twitter JSON field twitter.id $.id twitter.user $.user.name twitter.handle $.user.screen name twitter.createdAt $.created at twitter.text $.text twitter.timestamp $.timestamp ms twitter.hashtags $.entities.hashtags[0].text - gets only the first hash- tag twitter.mentions $.entities.user mentions[0].name - gets the first mentions twitter.lang $.lang twitter.location $.user.location

Table 5.2: Custom properties for extracting tweets

Filtering and Routing of data : All the tweets that are coming in at this step • are filtered by the terms that were set in the previous steps “IoT, InternetofThings, BigData”.Eventhoughtheyarecomingwiththerightterms,itisgoodtonote that there are whole lots of fields that are empty which make no sense to consider. So further filtering is needed to remove those empty fields for making better routing decisions. Filtering is done with the context Correct/Incorrect data in “RouteOnAt- tribute” processor provided by NiFi. This processor allows defining custom rules based on which routing decisions are made. The custom rules that decide a data is either Correct or Incorrect are:

(I) The Text, Hash Tags, Mentions, Language and Location fields extracted must not be Empty (II) The tweets are routed to different downstream systems based on the rule set; English and Non English Tweets.

42 The rules set use the extracted fields from the previous steps in “EvaluateJSON- Path” processor and are given names which are used as Connections for routing the data to different downstream systems. There is also another inbuilt rule which routes data if the custom rules are not satisfied. The rules are: to route Non Empty tweets that are English and Non English. The Empty Tweets that do not satisfy the custom rules are sent to the “Unmatched” Relationship. The above rule in NiFi Expression Language is: Rule 1: $twitter.text:isEmpty():not():and($twitter.location:isEmpty():not()):and($twitter.hashtags: isEmpty():not()):and($twitter.lang:isEmpty():not()):and($twitter.mentions:isEmpty():not())

Rule 2: English - $twitter.lang:equals(“en”) Non English - $twitter.lang:equals(“en”):not()

When combining the above rules, it gives: English: Rule 1 + Rule 2 = $twitter.text:isEmpty():not():and(twitter.location : isEmpty() : not()) : and(twitter.lang: equals(“en”))

NonEnglish: Rule 1 + Rule 2 = $twitter.text:isEmpty():not():and(twitter.location : isEmpty() : not()) : and(twitter.lang: equals(“en”):not())

“Unmatched” -oneormoreofthefieldsisempty.

(A) Correct Data Correct Data in this context is the data that satisfies the rules set above that is either English or Non English Tweets that are not empty. Here the correct data is routed based on its type to different sources. If the tweet is Non English, then it is routed to Solr for further analysis. It is to show the different languages that the tweet is made, the locations, the top tweeters and so on. And all tweets that are correct (i.e both Non Empty English and Non English) are persisted to HBase. (B) Incorrect Data Incorrect Data is with one or more of its fields empty. So it is sent to Log files through its “Unmatched” Relationship. “LogAttribute” processor is used here to simply log those flow files that do not satisfy the custom rules.

5.3.3 Data Storage Twitter produces unstructured data consisting of various formats such as videos, images, normal text, and media files and it is also produced in a fast way in large volumes. Once the data is processed, it has to be persisted into storage systems. But Traditional Rela- tional Database Management Systems (RDBMS) could not handle such data because of the volumes and variety of data produced from such social Medias. So NoSQL databases

43 have to be used which is capable of handling large volumes of unstructured data in an efficient way. In this regard, NoSQL database HBase is chosen for this project. After the data is processed in the previous steps, the correct and incorrect data are differentiated according to the rule set and are routed accordingly.

Correct data : All processed correct data is persisted in Apache HBase. In or- • der to store these data to Apache HBase, NiFi provides an inbuilt processor called “PutHBaseJSON” which writes data to the HBase database in JSON format. It has mandatory fields to be set before being ready for use. The first property that needs to be set is the HBase 1 1 2 ClientService property which is discussed in the above section. For this project, the table name and column family are “Twitter” and “tweets” respectively which are created with a command:

hbase(main):002:0>create ‘Twitter’, ‘tweets’ 0 row(s) in 1.3130 seconds =>Hbase::Table - Twitter

After the table is created, the other mandatory property for the specific HBase client are specified in NiFi; “Twitter” as the table name, “tweets” as column fam- ily, and also “Id” as Row Identifier Field Name which is from the JSON tweet Id field. After these properties are specified in the processor, then it is ready to start.

5.3.4 Data Indexing & visualization In this step, some of the processed data is sent to Apache Solr for indexing and searching. The rules/properties that were set are used here to send the data to Solr. In this regard, all the Non English tweets that were processed are sent to Solr for indexing. There is no demanding rule to choose only these tweets, but it is selected as an example to show how NiFi can be integrated with Apache Solr and also the different Language, Location distributions, Top tweets and so on in Solr. For achieving this, “PutSolrContentStream” processor is used. This processor has mandatory properties that need to be set such as Solr Type and Solr Location. The Solr Type specifies either Standard mode or Cloud mode.

For this project a Standard Solr mode is used. The Solr Location specifies the loca- tion where the Solr server is installed which is “http://52.30.209.198:8983/solr/twitter”. “twitter” is the Core where all the tweets are stored and indexed. It also allows defining custom properties which transform the JSON document to a Solr document type which later is used in Solr as an attribute for further analysis. The properties defined are shown in Table 5.3.

44 Property Solr field Name f.1 id:/id f.2 twitter text t:/text f.3 twitter username s:/user/name f.4 twitter created at s:/created at f.5 twitter timestamp ms tl:/timestamp ms f.6 twitter screenname s:/user/screen name f.7 twitter location s:/user/location f.8 twitter lang t:/lang f.9 twitter tag ss:/entities/hashtags/text f.10 twitter mentions ss:/entities/user mentions/name f.11 twitter source s:/source

Table 5.3: Custom properties for indexing tweets

The data that is processed and indexed this way has to be presented to the user in visualization to help make better decisions. It is also important to see which parameters are of most important to watch for so that the user will be aware of the things around. In this project, the indexed data is further put to a visualizations tool called “Banana” to show case the different properties of the tweet in a dashboard in real-time. In the search field, terms such as IoT are searched and hits for specific search terms, top tweeters, the languages and locations are visualized using bar, histogram and different components. After all the processors are properly configured and are ready to start, the NiFi UI looks like the Figure 5.2.

Figure 5.2: Over all NiFi Twitter Data Flow

45 5.3.5 Data Result & Discussion So after the flow is allowed to run, stream of tweets start flowing in real-time. Figure 5.3a shows an example of Non English tweets that are extracted with all the text, language, and location fields being not empty according to the rule and were routed accordingly to their respective downstream connections.

(a) Non English Tweets from Provenance data

(b) English Tweets from Provenance data

Figure 5.3: Both English and Non English Tweets from Provenance data

The Attribute names are the fields defined to extract the tweets from the Twitter API in the “EvaluateJSONPath” processor. The Figure shows the Date the tweet was cre- ated, the languages, locations, the text, and also the username and screen names which are shaded for privacy purposes. The “RouteOnAttribute.Route” field shows that it has arule“tweetsNonEnglish” which is defined to route tweets that are not in English to downstream connections. Figure 5.3b also shows the same information of the tweet with “RouteOnAttribute.Route” field showing it has a rule “tweetsEnglish” which routes English tweets to next processor. Statistics : NiFi also allows viewing the statistics of each and every processor flow • in different parameters. The parameters include Average Task Duration, Bytes

46 Read in last 5 minutes, Bytes written in last 5 minutes and Flow Files out in last 5 minutes and so on. Figure 5.4a below shows the Average Task Duration for the “PutSolrContentStream” processor.

(a) Average Task Duration status for Indexing

(b) Status for flow files out from the processor

Figure 5.4: Statistics data

The left side of Figure 5.4a shows the name and type of processor as well as the start and end time which shows the average task between these times. The last point shows the Min/Max/Mean time it takes to process the flows or send the flows to Apache Solr. From the graph, the most peak average task duration [00:00:00.076] is between the times 21:10 and 21:15. The other is a little less than [00:00:00.040] which is around 22:30. Figure 5.4b also shows the statistics for the “GetTwitter” processor which is re- sponsible for getting the tweets from the Twitter API. So it receives the tweets and sends them out to the next processor or downstream connection for further processing. This statistics shows the Flow Files that are transferred out in the last 5 minutes. The left side shows the start and end time and also the Min/Max/Mean

47 of the number of files transferred. So it is shown that the Max flow files transferred is 12 and the peak time is around 11:25.

5.3.6 Data Analysis in Solr Further analysis is also made in Apache Solr and its results are visualized using Banana. The analysis is made on search for specific terms in the tweets and returns the hits for each term, the languages, and the location of the tweeters.

Specific terms in a tweet : This analysis includes defining the terms to search • for in Solr and the number of hits that it returns for the terms such as “iot, bigdata, internetofthings” only in non English tweets which were used to filter in NiFi in the previous sections. The total indexed tweets are 789 and from these the search for the individual terms internetofthings, bigdata and iot returns 7, 25 and 30 respectively. And the search for all the terms returns 52.

The query used is: http://52.18.85.201:8983/solr/twitter/select?q=-twitter lang t:en+and+twitter text t: internetofthings+or+twitter text t:bigdata+or+twitter text t:iot&wt=json&indent=true

Figure 5.5: Filter for specific terms, “iot,bigdata,internetofthings”

Figure 5.5 also shows that the filtering is done with a query “-twitter lang t:en” which loads only non English tweets.

The Language distribution : Here in this analysis, the different languages the • tweets were made is given. When the terms iot, internetofthings, bigdata are searched for specific languages, it returns 24,10,4,2 for French, Spanish, German, and Japanese respectively.

48 Figure 5.6: Language distribution

The query used for non English languages: http://52.18.85.201:8983/solr/twitter/select?q=-twitter lang t:en+and+twitter text t: internetofthings+or+twitter text t:bigdata+or+twitter text t:iot&wt=json&indent=true

Figure 5.6 shows the language distribution in a pie chart where 52% is for the French, 22% for Spanish and so on. It also shows the World Map where white blue is shaded for the parts where tweets were made. Next it shows the Top Tweeters about a specific search terms used above.

The locations of the tweeters : This part of the analysis is related to the • locations of the tweeters to show which part of the globe is tweeting about certain terms such as iot, bigdata and interntofthings. The locations may sometimes be with country levels such as Sweden, Canada or they may be particular location with no country levels.

Figure 5.7: Location distribution

Figure 5.7 above shows the different locations and their occurrences when the above terms are searched from Banana User Interface. It also shows the Hashtags in a TagCloud panel and next the top Mentions for the terms searched above is shown in a Pie chart. This visualization help the user to search for specific terms from the whole tweets that are indexed and draw certain conclusions from the different properties/characteristics displayed in the dashboard.

49 Chapter 6

Evaluation

This chapter discusses about the performance evaluation and how the designed data flow can be optimized. NiFi’s performance is affected by many factors such as the type and number of processors used which affects the system resources (CPU, RAM,...), whether the flow is allowed to run without constraints producing large sizes (back pressure mechanism is applied or not) and whether clustering is used or not.

The type and number of processors used : The type of processors used deter- • mines how much resources are allocated to a particular processor and it differs from processor to processor. This is because some processors are resource intensive and they require more by default. And the number of processors used also has impact on performance because every processor used needs a thread to be allocated for it by Flow Controller to function properly. In this regard, grouping the processors with same functionality helps to minimize the threads that will be used and make them available for other processes.

Figure 6.1: Same Processors used repeatedly

50 Figure 6.1 shows that the same processors are used to extract and route data once they are ingested from the “GetTwitter” and “GetHBase” processors. This is against the performance of NiFi because it will take utilizing more threads for the same processors. So grouping the same functionality processors together is a good design choice and enhances performance. So Figure 6.1 becomes condensed by grouping the same processors together so that they use only single threads for their execution as shown in Figure 6.2.

Figure 6.2: same processors used once for performance gain

Figure 6.2 above shows the data from “GetTwitter” and “GetHBase” processors are all sent to one processor i.e “Extract Fields - EvaluateJSONPath” and downwards also.

Back Pressure mechanism used : If the flow files are allowed to run continuously • without any constraint, it will have big impact on the performance of NiFi and the system as a whole. The impact could be due to not properly updating the repositories or full disk because of many data flowing in. In order to solve this problem, NiFi provides back pressure mechanisms to be deployed on connections to downstream systems first by setting certain threshold in number of flow files to be processed or the size of the files so that the data will be allowed to flow until this threshold is reached as shown in Figure 6.3.

51 Figure 6.3: Setting back pressure for the connection

And then the downstream processor only accepts the flows below or same as the threshold and then after that if it has less than the threshold, it will process the backlog data from the queue. This prevents NiFi and the processor itself not to be overwhelmed in general. The back pressure can be set in every connection created between the processors or by using another processor called “ControlRate” only once. Then all downstream processors get the flows by the rate set in this proces- sor. In the Twitter analysis, it is possible either to set the back pressure in every Connec- tion or using “ControlRate” processor. Figure 6.4 shows using the “ControlRate” processor to set the threshold to 100KB. This means that it will process the data up to 100KB and if more flow files come, they will be queued and when the threshold have less size, then it will process the backlog from the queues.

Figure 6.4: using “ControlRate” processor to control the rate of flow to downstream processors

52 Clustering : If more data is flowing to the system and the available system • resources cannot handle, adding resources by using clustering method can be used for performance gain. NiFi works in a Master/Slave architecture where the master checks for loads of every other node/slave to assign the work. After calculating the load balancing for each of the nodes, then it assigns the work to the respective node/slave. And nodes can be added as needed as possible with more resources to distribute the work across the nodes which in turn results in per- formance gain. So clustering is also one way of solving the problem of performance issues.

53 Chapter 7

Conclusion and Future Work

This thesis project investigates handling of streaming data. It is divided into two parts which are the theoretical and practical. In the theoretical part, IoT is studied which also covers overview of tools such as Apache NiFi, Apache Spark Streaming and Apache Storm. The project further went to define parameters such as Ease of Use, Security, Reliability, Queued data/Buffering and Extensibility to overview the behaviors of the tools that are studied. This approach is important to use the results as a guide for future as well to choose one of the tools. From the study of the tools, it is found that Apache NiFi is a data processing tool with features such as user friendly Web UI, inbuilt security, fault tolerance, provenance and lineage, extensibility, clustering and many more. And it is highly suitable for IoT applications because of its extensibility feature that allows designing custom processors that are capable of ingesting data to NiFi in required formats. It is also found that Spark Streaming is a fast processing framework due to the fact that it uses in-memory computations and divides the incoming data in to small batches which reduce latency and speeds up the computation. And Apache Storm processes the data without breaking it down into chunks rather computes them as they arrive. Besides these both Spark Streaming and Apache Storm can be used for both simple data processing such as ETL operations and also for more complex computations which require MLlib algorithms, heavy computations, aggregations and such things while NiFi is used for simple data processing such as ETL, routing, data mediation, and similar operations. Finally, in the practical part, Apache NiFi is used for processing Twitter data and see the different tweets based on certain terms such as “iot, bigdata, internetofthings”. Further analysis also made for the number of hits for these terms, the location and language distribution of the tweets in Apache Solr and their results visualized in Banana framework. This shows that the platform/tool chosen for practical analysis i.e Apache NiFi is suitable for such use cases and can be used efficiently for data processing and analysis. It also shows that it can be easily integrated with external systems such as Apache HBase, Apache Solr and others as well.

7.1 Future Work

In the theoretical part of the thesis, the work done can be extended in various ways. It can be extended to include overview of more frameworks in the stream data processing

54 area. The study on IoT can also be further extended to overview the challenges and solutions in more detail. More parameters can be added to compare and contrast the tools used. And since the project mainly focuses on stream data processing and analysis, the tools studied comply with this, but it can be more extended to include overview of storage systems and search or indexing platforms.

The practical part,i.e Twitter data analysis, could also be extended with more features. It is only in some of the tweet fields that we were interested to extract and analyze but it can be extended to include other more fields as well. Different rules can be set than the one used to route the tweets with more routing rules. One extension point that will be interesting could also be, sending the processed data from NiFi to external systems such as Apache Spark for more complex computations on the tweets. And since the Twitter analysis is done as a benchmark to show how NiFi can be used for such cases, this work can further be extended and used also for processing data from other type of sources such as GeoSpatial, Sensor or other IoT data.

55 List of Figures

3.1 NiFistandaloneArchitecture-source[8] ...... 13 3.2 NiFiClusterArchitecture-source[8] ...... 14 3.3 NiFiUIcanvas ...... 17 3.4 NiFimaincomponents ...... 18 3.5 NiFiProcessorAnatomy ...... 19 3.6 NiFiProvenance ...... 21 3.7 NiFiLineage...... 21 3.8 Continuous RDDs form DStream - source [33] ...... 23 3.9 Spark Cluster - source [33] ...... 24 3.10 Storm Topology - source [38] ...... 26 3.11 Storm Cluster -source [38] ...... 27

4.1 General use case flow ...... 35 4.2 NiFiusecaseflow...... 36 4.3 Spark Streaming use case flow ...... 37 4.4 Storm use case flow ...... 38

5.1 DataAnalysisflow ...... 40 5.2 OverallNiFiTwitterDataFlow ...... 45 5.3 Both English and Non English Tweets from Provenance data ...... 46 5.4 Statistics data ...... 47 5.5 Filter for specific terms, “iot,bigdata,internetofthings” ...... 48 5.6 Language distribution ...... 49 5.7 Location distribution ...... 49

6.1 Same Processors used repeatedly ...... 50 6.2 same processors used once for performance gain ...... 51 6.3 Setting back pressure for the connection ...... 52 6.4 using “ControlRate” processor to control the rate of flow to downstream processors ...... 52

56 List of Tables

3.1 Storm ArchitectureComponentsFunctionality ...... 27

4.1 Differences and Similarity of the tools ...... 32

5.1 Mandatory properties for “GetTwitter” Processor ...... 41 5.2 Custom properties for extracting tweets ...... 42 5.3 Custom properties for indexing tweets ...... 45

Acronyms & Abbreviations

The Acronyms used in this report are outlined in the table below.

57 Acronym Description ASF Apache Software Foundation API Application Program Interface CPU CentralProcessingUnit CSV Comma Separated Value DStreams Discretized Streams DDoS Distributed Denial of Service ETL Extract Transform Load FTP File Transfer Protocol HDFS Hadoop Distributed File System HVAC Heating Ventilation Air Conditioning HDF Hortonwork Data Flow HDP Hortonwork Data Platform H2H Human-to-Human H2T Human-to-Things HTML Hyper Text Markup Language ICT InformationCommunicationTechnology ITU InternationalTelecommunicationUnit IoT Internet of Things JSON JavaScript Object Notation JVM MIT Massachusetts Institute of Technology NSA National Security Agency NCM NiFi Cluster Manager OS Operating System QoS Quality of Service RFID RadioFrequencyIdentification RPG Remote Process Group RDD Resilient Distributed Dataset SSL Secure Socket Layer S3 SimpleStorageService T2T Things-to-Things TLP Top Level Project TCP/IP Transmission Control Protocol/Internet Protocol URL Uniform Resource Locator UI User Friendly WSN Wireless Sensor Network WAL Write Ahead Logging XML ExtensibleMarkupLanguage

58 Bibliography

[1] “Ericsson IoT”. url: http://www.ericsson.com/thecompany/our_publications/ books/internet-of-things (visited on 02/11/2016). [2] D. Miorandi et al. “Internet of things: vision, applications and research challenges”. In: Ad Hoc Networks vol. 10,no. 7, (2012), pp. 1497–1516. [3] Dave Evans. “The Internet of Things, How the Next Evolution of Internet is chang- ing Everything (white paper)”. Tech. rep. April 2011. [4] James Manyika et al. “The Internet of Things: Mapping the Value Beyond the Hype”. In: McKinsey Global Institute (June 2015), p. 3. [5] David Neiwolny. “How the Internet of Things is Revolutionizing Healthcare (white paper)”.Tech.rep.October2013. [6] R. Weber. “Internet of Things: New security and privacy challenges”. In: Computer Law and Security Review vol. 26,no. 1 (2010), pp. 23–30. [7] M. Zaharia et al. “Discretized Streams: An Efficient and Fault-Tolerant Model for on Large Clusters”. In: (2012). [8] “NiFi Overview”. url: https : / / nifi . apache . org / docs . html (visited on 02/02/2016). [9] “Series Y: Global Information Infrastructure, Internet Protocol Aspects and Next- Generation Networks, Next Generation Networks – Frameworks and functional ar- chitecture models (white paper)”. Tech. rep. 2012. [10] K. Ashton. “That “Internet of Things” Thing”. In: RFID journal (2009). [11] Somayya Madakam, R. Ramaswamy, and Siddharth Tripathi. “Internet of Things(IoT): ALiteratureReview”.In:Journal of Computer and Communications vol. 3, (2015), pp. 164–173. [12] Jayavardhana Gubbi et al. “Internet of Things (IoT): A Vision, Architectural Ele- ments, and Future Directions”. In: (). [13] Sean Dieter et al. “Towards Implementation of IoT for environmental condition monitoring in homes”. In: IEEE Sensors Journal vol. 13,no. 10, (Oct 2013). [14] “Apple HomeKit”. url: http : / / www . apple . com / ios / homekit/ (visited on 02/13/2016). [15] Pedro Castillejo et al. “An Internet of Things Approach for Managing Smart Ser- vices Provided by Wearable Devices”. In: International Journal of Distributed Sen- sor Networks (2013).

59 [16] Melanie Swan. “sensor mania! The IoT, wearable computing, objective Metrics and quantified self 2.0”. In: Journal of Sensor and Actuator Networks (2012). [17] Andrea Zarella et al. “Internet of Things for smart cities”. In: IEEE Internet of Things Journal vol. 1,no. 1, (2014), pp. 22–31. [18] Ji chun Zhao et al. “The study and application of the IoT technology in Agricul- ture”. In: (2010). [19] “IoT in Agriculture Case Study,Thingworx”. url: http://www.thingworx.com/ Markets/Smart-Agriculture (visited on 02/06/2016). [20] Debasis Bandyopadhyay and Jaydip Sen. “Internet of Things - Applications and Challenges in Technology and Standardization”. In: (2011). [21] Krushang Soner and Hardik Upadhyay. “A survey: DDoS Attack on Internet of Things”. In: International Journal of Engineering Research and Development vol. 10,no. 11, (Nov 2014), pp. 58–63. [22] J. H. Ziegeldorf, O. Garcia Morchon, and K. Wehrle. “Privacy in the Internet of Things: threats and challenges”. In: Security and Communication Networks vol. 7,no. 12, (2014), pp. 2728–2741. [23] Bugra Gedik and Ling Liu. “Protecting Location Privacy with Personalized K- Anonymity: Architecture and Algorithms”. In: IEEE Transaction on Mobile Com- puting vol. 7,no. 1, (2008). [24] “Privacy by Design in Big Data”. In: (Dec 2015). [25] “NSA NiFi”. url: https://www.nsa.gov/public_info/press_room/2014/ nifi_announcement.html (visited on 03/12/2016). [26] “NiFi Key Features”. url: https://docs.hortonworks.com/HDPDocuments/ HDF1/HDF-1.1.0/bk_Overview/content/high-level-overview-of-key-nifi- features.html (visited on 03/13/2016). [27] “Apache NiFi wiki”. url: https://cwiki.apache.org/confluence/display/ NIFI/Apache+NiFi (visited on 03/18/2016). [28] “Spark Overview”. url: http://www.spark.apache.org (visited on 03/20/2016). [29] “Spark AMPLab”. url: https://amplab.cs.berkeley.edu/projects/spark- lightning-fast-cluster-computing/ (visited on 03/20/2016). [30] “Spark SQL Module”. url: http://www.spark.apache.org/sql/ (visited on 03/20/2016). [31] “Spark GraphX Module”. url: http://www.spark.apache.org/graphx/ (visited on 03/20/2016). [32] “Spark Machine Learning Module”. url: http://www.spark.apache.org/mllib/ (visited on 03/20/2016). [33] “Spark Streaming Module”. url: http://www.spark.apache.org/streaming/ (visited on 03/20/2016). [34] “Spark Programming Guide”. url: http://spark.apache.org/docs/1.6.0/ programming-guide.html (visited on 04/02/2016).

60 [35] “Apache Storm”. url: http : / / storm . apache . org / index . html (visited on 04/02/2016). [36] “Apache Storm history”. url: http : / / nathanmarz . com / blog / history - of - apache-storm-and-lessons-learned.html (visited on 04/05/2016). [37] “Storm Feature”. url: http://hortonworks.com/hadoop/storm/ (visited on 04/02/2016). [38] “Storm Tutorial”. url: http://storm.apache.org/releases/0.9.6/ (visited on 04/02/2016). [39] “Spark Security”. url: http://spark.apache.org/docs/1.6.0/security.html (visited on 05/25/2016). [40] “Storm Thrift API”. url: http://thrift.apache.org/docs/features (visited on 05/28/2016).

61 Appendix - Apache License, 2.0.

The material content of this thesis project is licensed under the Apache License, 2.0.

You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

62