Handling Data Flows of Streaming Internet of Things Data

IT16048 Examensarbete 30 hp Juni 2016 Handling Data Flows of Streaming Internet of Things Data Yonatan Kebede Serbessa Masterprogram i datavetenskap Master Programme in Computer Science i Abstract Handling Data Flows of Streaming Internet of Things Data Yonatan Kebede Serbessa Teknisk- naturvetenskaplig fakultet UTH-enheten Streaming data in various formats is generated in a very fast way and these data needs to be processed and analyzed before it becomes useless. The technology currently Besöksadress: existing provides the tools to process these data and gain more meaningful Ångströmlaboratoriet Lägerhyddsvägen 1 information out of it. This thesis has two parts: theoretical and practical. The Hus 4, Plan 0 theoretical part investigates what tools are there that are suitable for stream data flow processing and analysis. In doing so, it starts with studying one of the main Postadress: streaming data source that produce large volumes of data: Internet of Things. In this, Box 536 751 21 Uppsala the technologies behind it, common use cases, challenges, and solutions are studied. Then it is followed by overview of selected tools namely Apache NiFi, Apache Spark Telefon: Streaming and Apache Storm studying their key features, main components, and 018 – 471 30 03 architecture. After the tools are studied, 5 parameters are selected to review how Telefax: each tool handles these parameters. This can be useful for considering choosing 018 – 471 30 00 certain tool given the parameters and the use case at hand. The second part of the thesis involves Twitter data analysis which is done using Apache NiFi, one of the tools Hemsida: studied. The purpose is to show how NiFi can be used for processing data starting http://www.teknat.uu.se/student from ingestion to finally sending it to storage systems. It is also to show how it communicates with external storage, search, and indexing systems. Handledare: Markus Nilsson Ämnesgranskare: Matteo Magnani Examinator: Edith Ngai IT16048 Tryckt av: Reprocentralen ITC iii Acknowledgment It is with great honor that I express my gratitude to the ”Swedish Institute” for awarding me the ”Swedish Institute Study Scholarship” for my Masters study at Uppsala University, Uppsala, Sweden. I also like to extend my gratitude for my supervisor Markus Nilsson for providing me with a chance to work the thesis at Granditude AB and give me important feedback on this report and my reviewer Matteo Magnani from Uppsala University for being my reviewer and see my progress each time. My gratitude also goes to the whole team at Granditude for being supportive and provide good working environment. And last but not least, I would like to thank my family and friends for their prayers and support. Thank You! iv Contents 1 Introduction 1 1.1 Problem Formulation and Goal . 1 1.2 Scope and Method . 2 1.3 Structure of the report . 3 1.4 Literature Review . 3 2 Internet of Things Overview 4 2.1 Technologies in IoT . 4 2.1.1 RadioFrequencyIdentification(RFID) . 5 2.1.2 Wireless Sensor Network (WSN) . 5 2.1.3 TCP/IP (IPv4,IPv6) . 5 2.1.4 VisualizationComponent. 5 2.2 ApplicationAreas............................... 6 2.2.1 SmartHome.............................. 6 2.2.2 Wearable . 6 2.2.3 Smart City . 6 2.2.4 IoT in Agriculture - Smart Farming and Animals . 7 2.2.5 IoTinHealth/ConnectedHealth . 7 2.3 Challenges and Solutions . 8 2.3.1 Challenges . 8 2.3.2 Solutions . 9 3OverviewofTools 11 3.1 ApacheNiFiHistoryandOverview . 11 3.1.1 NiFi Architecture . 12 3.1.2 Key Features . 14 3.1.3 NiFiUIcomponents ......................... 16 3.1.4 NiFiElements............................. 17 3.2 ApacheSparkStreaming . 22 3.2.1 Key Features . 22 3.2.2 Basic Concepts and Main Operations . 23 3.2.3 Architecture.............................. 24 3.3 ApacheStorm................................. 25 3.3.1 Overview . 25 3.3.2 BasicConceptsandArchitecture . 25 3.3.3 Architecture.............................. 26 v 3.3.4 Features . 27 4 Review and Comparison of the Tools 28 4.1 Review . 28 4.1.1 ApacheNiFi.............................. 28 4.1.2 Spark Streaming . 29 4.1.3 ApacheStorm............................. 31 4.2 Differences and Similarities . 32 4.3 Discussionoftheparameters. 32 4.4 Howeachtoolhandlestheusecase . 35 4.5 Summary . 38 5 Practical analysis/Twitter Data Analysis 39 5.1 Problem definition . 39 5.2 Setup . 40 5.3 Analysis .................................... 40 5.3.1 Data Ingestion . 41 5.3.2 Data Processing . 42 5.3.3 Data Storage . 43 5.3.4 Data Indexing & visualization . 44 5.3.5 Data Result & Discussion . 46 5.3.6 DataAnalysisinSolr. 48 6 Evaluation 50 7 Conclusion and Future Work 54 7.1 Future Work . 54 References 57 Appendix - Apache License, 2.0. 62 vi vii Chapter 1 Introduction The number of connected devices to the internet is increasing each year at an alarming rate. It is expected that 50 billion devices will be connected to the internet by 2020 according to Cisco of which most of the connections are from Internet of Things (IoT) devices such as wearable, smart home appliances, connected cars and many more [1][2]. And large volume of data is produced from these devices in a very fast rate that needs to be processed in real-time to gain more insight from it. There are different kinds of tools designed to process only one form of data; either static or real-time, or some are designed for processing both static and real-time. This thesis project mainly deals with handling/processing of real-time data flows after a thorough study on some selected stream analytic tools has been made. The thesis project is done at Granditude AB [4]. Granditude AB provides advanced data analytics and big data solutions built up on open source software to satisfy the needs of their customers. The company mainly uses open source frameworks and projects in the Hadoop ecosystem. 1.1 Problem Formulation and Goal There are different types of data sources; namely real-time and static data sources. The data produced by real-time sources has characteristics such as fast, continuous, very large, structured/unstructured. And the data from the static source is a historical data stored which is very large and is used for enriching the real-time data. Real-time data being produced in a fast way; it has to be processed by the rate it is produced before it is perished. So this is one problem which streaming data face that the data may not be processed fast enough. The data coming from these two sources need to be combined, processed and analyzed to provide meaningful information out of the analysis that in turn is vital for making better decisions. But this is also another area of problem for stream data flow processing where these data from both sources is not combined due to poor integration of the different sources (static & real-time) together or data coming from different mobile devices that will result in a data that is not analyzed properly, not enriched from historical data and hence produce poor result. The other problems of streaming data that makes its handling or processing difficult is inability to adopt to real-time changing conditions as for example when errors occur. There are many tools which mainly process stream data; but studying, understanding, 1 and using all these platforms as they come is not scalable and not covered in this work. This project aims to make processing of flow of streaming data using one tool. To achieve this, first overview of selected tools in this area is done and then the tool that is going to be used in the analysis is chosen after review and discussion of the tools using certain parameters and use case. This thesis project generally tries to answer questions such as: What tools currently exist that are used for data extraction, processing, and also • analysis? - to study some of the selected tools in this area - architecture, key features, components Based on the study, describing which tool is good for a particular use case? • Which tool best handles both static and real-time data produced for analysis? • which tool enables to make changes in the flow easily? • The defined use case consists of both real-time and static data to be processed and analyzed. The real-time data is tweets from Twitter API and the static one is initially stored tweets from NoSQL database HBase. Then the two data sources need to be combined and filtered out based on given properties. Based on the filtered result, incorrect data will be logged in to another file while the correct data will be stored to HBase. Finally, some of the filtered data will be indexed into Solr which is an enterprise search platform. In this process, we will see what happen to each input sources before and after they are combined together. What techniques are used to merge, filter, what priority levels should be given to each of them, are also some of the questions that are answered during these stage. The basis for separating the data as correct and incorrect is also defined. 1.2 Scope and Method The project is mainly divided in two parts which are: Theoretical and Practical/Analysis part. In the theoretical part, IoT will be studied as it uses many devices that produce these large amounts of data in a fast way. In addition to that, the challenges it has and the solutions that should be taken, and common use cases that are existing in IoT are covered. Next to that, overview of some of selected tools/platforms is done which consists of the study of their main components, features, and common use cases. Besides these, the tools are further reviewed by defining use case and certain parameters and see how each of the tools handle the parameters defined.

Handling Data Flows of Streaming Internet of Things Data

Use Splunk with Big Data Repositories Like Spark, Solr, Hadoop and Nosql Storage

Personal Information Backgrounds Abilities Highlights

Hortonworks Cybersecurity Platform Administration (April 24, 2018)

My Steps to Learn About Apache Nifi

Hdf® Stream Developer 3 Days

Hortonworks Dataflow Getting Started (January 31, 2018)

End to End Data Flow Management and Streaming Analytics Platform

Professional Summary Technical Skills

Code Smell Prediction Employing Machine Learning Meets Emerging Java Language Constructs"

Using SAS Event Stream Processing with Other Applications

Testsmelldescriber Enabling Developers’ Awareness on Test Quality with Test Smell Summaries

Hortonworks Dataflow Overview (February 28, 2018)