Evaluation of Distributed Stream Processing Frameworks for Iot Applications in Smart Cities
Total Page:16
File Type:pdf, Size:1020Kb
Nasiri et al. J Big Data (2019) 6:52 https://doi.org/10.1186/s40537-019-0215-2 RESEARCH Open Access Evaluation of distributed stream processing frameworks for IoT applications in Smart Cities Hamid Nasiri* , Saeed Nasehi and Maziar Goudarzi *Correspondence: [email protected] Abstract Department of Computer The widespread growth of Big Data and the evolution of Internet of Things (IoT) Engineering, Sharif University of Technology, Azadi Avenue, technologies enable cities to obtain valuable intelligence from a large amount of Tehran, Iran real-time produced data. In a Smart City, various IoT devices generate streams of data continuously which need to be analyzed within a short period of time; using some Big Data technique. Distributed stream processing frameworks (DSPFs) have the capac- ity to handle real-time data processing for Smart Cities. In this paper, we examine the applicability of employing distributed stream processing frameworks at the data processing layer of Smart City and appraising the current state of their adoption and maturity among the IoT applications. Our experiments focus on evaluating the perfor- mance of three DSPFs, namely Apache Storm, Apache Spark Streaming, and Apache Flink. According to our obtained results, choosing a proper framework at the data analytics layer of a Smart City requires enough knowledge about the characteristics of target applications. Finally, we conclude each of the frameworks studied here have their advantages and disadvantages. Our experiments show Storm and Flink have very similar performance, and Spark Streaming, has much higher latency, while it provides higher throughput. Keywords: Distributed stream processing, Smart City, IoT applications, Latency, Throughput Introduction Large number of embedded devices, massive volumes of data, users and applications are driving the digital world to move faster than ever. To be competitive in today’s digi- tal economy companies have to process large volumes of dynamically changing data at real-time. Tere are many industries from health-care, e-commerce, insurance and tel- ecommunications with various use cases such as DNA sequencing, capturing customer insights, real-time ofers, high-frequency trading, and real-time intrusion detection that have taken the use of Big Data analytics into account to make critical decisions that impact their business [1]. On the other hand, the Internet of Tings (IoT) is becoming the primary grounds for data mining and Big Data analytics [2]. With the rapid growth of IoT and its use cases in diferent domains such as Smart City, Mobile e-Health and Smart Grid, streaming applications are driving a new wave of data revolutions. In most IoT appli- cations the resulting analytics give some feedbacks to the system to improve it [3]. Compared to the other Big Data domains, there is a low-latency cycle between system © The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Nasiri et al. J Big Data (2019) 6:52 Page 2 of 24 responses which makes it necessary to process events in real-time, to derive accept- able responsiveness. In all of these domains, one of the most fundamental challenge is to explore the large volumes of data and extract useful information for future actions. In particular, this real-time exploration has to be done at massive scales. Nowadays generated data in IoT era have several characteristics that put them in the class of Big Data [4]. Specially, in a Smart City generated data normally has the following characteristics: • Large volumes of data the amount of real-time generated data by diferent applica- tions in a Smart City can be in order of zetabytes. • Heterogeneous data sources in Smart Cities, the data sources are diverse. For example, there are many sensors data, RFID data, cameras data, human generated data, and so on. • Heterogeneous data types collected data by diferent devices are diferent in for- mat, packet size, required precision and arrival time. Many distributed Big Data platforms are designed to provide scalable processing on commodity clusters. Apache Hadoop [5] is one of the most popular frameworks for batch processing, which uses MapReduce [6] programming model. Tere are several large-scale computing architectures customized for batch processing [7]; however, they are not suitable for stream processing; because in the MapReduce paradigm all input data need to be stored on a distributed fle system (like HDFS) before start to process. To address the large-scale real-time processing problem, some distributed frameworks such as Apache Storm [8], Spark Streaming [9] and Apache Flink [10] have been emerged. Tey process a continuous stream of messages on distributed resources with low-latency and high throughput. Tese recent Big Data platforms are becoming one of the most essential components of IoT ecosystems. Smart City It is expected to about 68% of the population to live in cities by 2050 [11]. On the other hand, using technology to improve citizen’s lives makes a city into a smart one. In other words, Smart Cities ofer their citizens a high quality of life by improving processes and reducing environmental footprint with technology [12]. Tese cities capture data using sensors, and use Big Data analytics techniques to improve environmental, fnancial, and social aspects of urban life. According to [13] Smart City technology, spending reached $80 billion in 2016, and is predicted to grow to $135 billion by 2021. An intelligent combination of technological advances such as IoT, Cloud Comput- ing, Smart Grid and Smart Building allows tracking huge amounts of information; this combination creates an intelligent system known as Smart City. A Smart City uses IoT sensors and many other technologies to connect components through a city to derive data and improve the lives of citizens and visitors. For example, in a Smart City, buildings have interconnected electrical, cooling, and heating systems moni- tored by an intelligent management system and a Smart Grid can be used to control on-demand generation of electricity. Nasiri et al. J Big Data (2019) 6:52 Page 3 of 24 System architecture for Smart City Te backbone of a Smart City is Internet of Tings and the IoT itself consists of dif- ferent areas such as embedded systems, ubiquitous computing and Big Data analytics. Normally, the system architecture for IoT has fve layers including: Devices, Gateways, Access network, Data processing and Application [4]. So the system structure that inte- grates Big Data into Smart City [12] can be considered as Fig. 1. Layers in this architec- ture are similar to IoT system architecture layers and can be divided as follow: 1. Device layer: Set of sensors, RFID, cameras, and other devices that continuously capture physical measures such as temperature, humidity, light, vibration, location, movement, etc. 2. Data collection layer: Continuous streams of unstructured data which are being cap- tured by device layer, will be transferred to the data nodes. In the data collection layer, all of the collected data by the data nodes will be aggregated. Aggregation of these raw data allows ofering usable services. 3. Data processing layer: In this layer, the aggregated data are processed using batch or stream distributed processing engines such as Hadoop [14], Spark [15], Storm [16], and Flink [17]. Intelligent Traffic Web Display Applicaon Recommender Fraud Detecon service Layer Management Adversing Smart Analyzer Programming model for processing large sets with a parallel, distributed algorithms Data Processing Layer Distributed fault tolerance database for large unstructured data sets like NOSQL Data Collecon Layer Source Of Big Data Device Layer Fig. 1 System architecture of Smart Cities Nasiri et al. J Big Data (2019) 6:52 Page 4 of 24 4. Application service layer: Here many applications such as intelligent trafc manage- ment, water and electricity monitoring, disaster discovery, fraud detection and web display analysis are provided. In this layer, people and machines directly interact with each other. In the third layer of this architecture the main requirements for Big Data processing are needed. So, choosing efcient Big Data tools and frameworks has important role to utilize the processing resources. Te goal of this paper is to examine applicability of employing distributed stream processing frameworks at the data processing layer of Smart Cities and appraising the current state of their adoption and maturity among the IoT applications. Regardless to where the data processing is happening (edge or center of processing layer) our experi- mental results can be taken into account to determine the best framework depending on the processing requirements of streaming applications. Te main contribution of our work is providing a fair comparison among three well-known DSPFs in terms of perfor- mance, throughput, and scalability. In comparison with our previously published conference paper [18], we have added complementary information to “Programming model of stream processing” section and studied more frameworks in “Distributed stream processing frameworks” section. Te discussed frameworks are also explained with more details and some fgures are used to better show their architectures. A new section is added to cover the prior works that evaluate stream processing systems from diferent aspects. Moreover, to show the pow- erful cooperation of DSPFs and message brokers (which makes the real-time streaming processing of Big Data possible) we have added some information about the integration of DSPFs with message brokers. Te evaluations are extended and we have also taken into account the resource uti- lization (including CPU and network usage) in our experiments. Another real-world application “Training application” is chosen among several benchmark application and is developed for all intended frameworks.