Lambda Architecture for Distributed Stream Processing in the Fog

DIPLOMARBEIT

zur Erlangung des akademischen Grades

Diplom-Ingenieur

im Rahmendes Studiums

Software Engineering &Internet Computing

eingereicht von

Matthias Schrabauer,BSc Matrikelnummer 01326214

an der Fakultät für Informatik der Technischen Universität Wien

Betreuung: Associate Prof.Dr.-Ing. Stefan Schulte

Wien, 2. Februar 2021 Matthias Schrabauer Stefan Schulte

Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at

Lambda Architecture for Distributed Stream Processing in the Fog

DIPLOMA THESIS

submitted in partial fulfillment of the requirements forthe degree of

Diplom-Ingenieur

in

Software Engineering &Internet Computing

by

Matthias Schrabauer,BSc Registration Number 01326214

to the Faculty of Informatics at the TU Wien

Advisor: Associate Prof.Dr.-Ing. Stefan Schulte

Vienna, 2nd February, 2021 Matthias Schrabauer Stefan Schulte

Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at

Erklärung zur Verfassungder Arbeit

Matthias Schrabauer,BSc

Hiermit erkläre ich, dass ichdieseArbeit selbständig verfasst habe,dass ichdie verwen- detenQuellenund Hilfsmittel vollständig angegeben habeund dass ichdie Stellen der Arbeit–einschließlichTabellen,Karten undAbbildungen –, dieanderen Werken oder dem Internet im Wortlaut oder demSinn nach entnommensind, aufjeden Fall unter Angabeder Quelleals Entlehnung kenntlich gemacht habe.

Wien, 2. Februar 2021 Matthias Schrabauer

v

Danksagung

An dieser Stelle möchte ichmichbei allen Personen bedanken, die michwährendder Erstellung dieserArbeitunterstütztund motivierthaben. Besonders möchte ichmichbei Herrn Associate Prof. Dr.-Ing. Stefan Schulte bedanken, der diese Arbeitbetreutund begutachtet hat.Für die zahlreichen, hilfreichen Anregungen und die konstruktiveKritik beider Erstellung dieser Arbeitmöchteich michherzlich bedanken. Ebenfalls möchte ichmichbei meinenMitstudenten und Mitstudentinnenbedanken, die mir immer hilfsbereit zurSeite standen, wenn icheine zweite Meinung benötigte oder ich technischeDetails diskutierenwollte. Abschließend möchteich michbei meinenEltern bedanken, die mir durchihre Unterstüt- zung meinStudium erst ermöglichthaben.

vii

Acknowledgements

Iwanttouse this opportunity to thank all thepeoplewho supported and motivatedme duringthe writing of this work. Especially,Iwant to thank my advisor Associate Prof. Dr.-Ing. Stefan Schulte who supervised andreviewed this thesis. Iamespeciallythankfulfor hisnumeroushelpful suggestions and theconstructive criticism he offered me throughoutthe writing of this thesis. Iwouldalsoliketothank my fellowstudents, who have alwaysbeen helpfulwhen I needed asecondopinion or someone to discuss technical details. Finally,Iwould liketothank my parents, whosesupport made my studies possible in the first place.

ix

Kurzfassung

Der digitale Wandelführt zu einem stetig wachsendenDatenaufkommen. Mitdem Wachstum von“Big Data”steigtder Bedarf, diese großenDatenmengen zu analysieren und nutzbringend zu verwenden (bezeichnet als Stapelverarbeitung). Dazu habensich Programmiermodelle, Frameworks, Plattformen undTools wiedas - Ökosystem und das MapReduce-Programmiermodell etabliert.Solche Systeme sindfür die Stapelverarbeitung konzipiert und eignen sichdahernicht für die Verarbeitung vonDatenströmen.Mit dem Aufkommenvon Anwendungsszenarien wieSmartCities und autonomenFahrensteigtder Bedarf,kontinuierliche Datenströme in Echtzeitzu verarbeiten. Zu diesem ZweckwerdenFrameworks zur Datenstromverarbeitung wie oder ApacheFlink eingesetzt.AllerdingsarbeitensolcheFrameworks in der Regel in derCloud innerhalb eines lokalen Clustersmit geringer Latenz. Für Internet of Things (IoT) Anwendungen führt dieser zentralisierte Ansatz oft zu hohen Latenzen,daDatenströme (z.B.Sensordaten)erst in dieCloud geschicktwerdenmüssen, um sie zu verarbeiten. Um dieses Problem zu adressieren undIoT Daten effizientzu verarbeiten, reicht die Cloud allein nichtmehraus. Es gibteinenzunehmenden Trend, die Verarbeitung vonDatennäheranden Rand desNetzwerkszuverlagern,wodie Daten erzeugtund gespeichertwerden. Um sowohl die Vorteile derStapel-als auchder Datenstromverarbeitung zu nutzen, wurde die Lambda-Architektur eingeführt.Diese Architekturbasiert auf drei Schichten, dieesermöglichen,großeMengen an historischen Daten effizient zu verarbeiten(“Batch-Schicht”und “Serving-Schicht”) undgleichzeitig kontinuierliche DatenströmeninEchtzeitzuprozessieren (“Speed-Schicht”). Zieldieser Arbeitist es, einen Lösungsansatzzuentwerfenund zu implementieren, der sowohl die Lambda-Architektur als auchFog-Computingnutzt, um Datenströme in Echtzeit zu verarbeiten. Die Evaluierung konzentriert sichdarauf,wie gut Fog-basierteDatenstromverarbeitungs- TopologienimVergleichzueinemtraditionellen Cloud-Ansatzabschneiden.Für eine quantitative Bewertung werdengängige Metriken aus demBereichder Datenverarbeitung verwendet (Latenz, Round-Trip-Zeitvon Datenpaketen). Die Evaluierungdes Lösungs- ansatzes zeigt, dassder Einsatzvon verteilter Datenstromverarbeitung in der Fogeine vielversprechende Alternative zur traditionellen Datenverarbeitung in der Cloud sein kann. Insgesamtzeigen dieErgebnisse eine Verringerung derRound-Trip-Zeiten. Insbesondere, wenn die Latenz zurCloud über 50 ms liegt oder dieDatenpaketgröße recht groß ist.

xi

Abstract

The digital transformation is leading to aconstantlyincreasingvolumeofdata. With the growthofbig data,there is arising demandfor analyzingand making use of those largepiles of data(referredtoasbatch processing). To do that, programming models, frameworks, platforms, and tools suchasthe Apache Hadoop ecosystem and the MapReduceprogramming modelhavebeenestablished. Suchsystemshavebeendesigned forbatch processing and are thereforenot suitablefor (real-time) stream processing. With application scenarioslikesmartcitiesand autonomousdrivingemerging, thereisa growing need to process continuous streams of data close to real-time. Forthis purpose, distributed stream processing frameworkssuchasApacheStormorApacheFlink are used to analyze data streams. However, such frameworks usuallyoperate in the cloud within alocal cluster with lowlatency.For Internet of Things (IoT) applications,this centralizedapproachoften leads to highlatency, since data streams (e.g.,sensor data) must be senttothe cloudfirst, in order to process it.Toaddress this issue and to efficiently processIoT dataonalarge scale, thecloud alone is no longer sufficient. There is an increasingtrend to push the processing of datacloser to theedgeofthe network, where thedataisgenerated and stored.Inorder to takeadvantage of both batchand stream processing,the lambdaarchitecture design patternhas been introduced. This architectural styleisbasedonthree layers, whichallowtoefficientlyprocess massive volumesofhistoricaldata (batchand serving layer) while simultaneously using stream processing to provide areal-time analysis of continuous datastreams (speed layer). Thegoal of this work is to design and implementasolution approach, which makes use of thelambdaarchitecture as well as fog computingtoprocess data streams in real-time. The evaluation focuses on howwellfog-basedstreamprocessing topologiesperform compared to atraditionalcloud approach. Common metrics in the field of dataprocessing are used foraquantitativeevaluation(latency,round-triptime of datapackets). The evaluation of thesolution approachshows thatusing distributedstream processing in the fogcan be averypromising alternativecompared to traditional dataprocessing in the cloud. Overall, the resultsshowadecreaseinthe round-trip times. Especially if the latency to the cloud is over 50 ms or thedata packetsize is quite large.

xiii

Contents

Kurzfassung xi

Abstract xiii

Contents xv

1Introduction1 1.1 Motivation and Problem Statement...... 1 1.2 Aim of theWork ...... 2 1.3 Methodology and Approach...... 3 1.4 Structure ...... 5

2Background 7 2.1 InternetofThings...... 7 2.2 FogComputing ...... 11 2.3 BigDataAnalytics...... 15 2.4 LambdaArchitecture ...... 20

3Related Work 25 3.1 DistributedStream Processing for theIoT and FogComputing ....25 3.2 LambdaArchitecture forDistributedStreamProcessing ...... 35 3.3 FogComputing Infrastructure ...... 37 3.4 Conclusion ...... 38

4RequirementsAnalysis and Design 41 4.1 Requirements ...... 41 4.2 Architecture ...... 44

5Implementation 53 5.1 Infrastructure Setup...... 53 5.2 DevelopmentOperations ...... 56 5.3 Implementation of theLambdaArchitecture ...... 58 5.4 Implementation of Non-Functional Requirements ...... 62 5.5 Limitations ...... 63

xv 6Evaluation 65 6.1 DataSets...... 65 6.2 Motivational Scenario...... 66 6.3 Testbed...... 67 6.4 Topology...... 69 6.5 Benchmarks ...... 71 6.6 Summary...... 83

7Conclusion and Future Work 87 7.1 Discussion ...... 87 7.2 Future Work ...... 89

ListofFigures 91

List of Tables93

Acronyms 95

Bibliography 97 CHAPTER 1

Introduction

1.1 Motivation and Problem Statement

With thegrowth of bigdata, there is arising demandfor analyzingand making use of those largepiles of data[1]. To do that, programming models, frameworks, platforms, and tools have been established. Mostofthose distributed bigdata platforms are designed to run on commodityclusters.ApacheHadoop in combination with theMapReduce programming modelisthe most prominentexample forsuchaplatform[2,3]. Such systemshave been designedfor batchprocessing andare therefore not suitable for real- timestreamprocessing.With application scenarios like smartcities and autonomous driving emerging,there is agrowingneedtoprocess continuous streams of data close to real-time [4]. In other words, big data processing systems are evolvingtobemore stream-oriented,whereeachdatarecord is processedasitarrives on acontinuous basis [5]. Consequently, distributed stream processingframeworks, like Apache Storm [6], [7], [8],ApacheSpark Streaming [9] and others have emerged. The aforementionedframeworksare designed forhomogeneous nodeswithin alocal cluster with lowlatency (cloud computing) [10]. ForInternetofThings(IoT) applications, this centralizedapproachoften leads to highlatency, sincedatastreams (e.g., sensordata) must be senttothe cloud first, in order to process the data.Toaddressthis issue and to efficientlyprocess IoTdataonalarge scale, thecloud aloneisnolongersufficient[11]. There is an increasingtrend to push the processing of data closer to the edgeofthe network, where the data is generated andstored[12]. According to recentresearch, it canbebeneficial to deploy stream processing operatorsinthe fog [13]and to use those computational resources at theedge of the network to addresshigh latencytothe cloud [14]. In order to takeadvantage of both batchand stream processing, thelambdaarchitecture design patternhas been introduced[15, 16]. Thisarchitectural styleisbasedonthree

1 1. Introduction

layers whichallowtoefficientlyprocess massivevolumes of historical data(batchand serving layer), while simultaneously using stream processing to provideareal-time analysis of continuous datastreams (speed layer).Aseachlayer consistsofits owndata store and stream processing operatorsare deployedinthe fog, it is desirable to deploythe data storeofthe speedlayer on theedge of the network to avoid high latencies to the cloud.Moreover, the batchand serving layerofthe lambdaarchitecture allow to reduce the response time foroften usedlong runningqueries, as theresultsofthese queriesare stored in precomputed batch views. As of today, there exists no solutionapproachwhichaddressesthe requirements of the lambda architecture design pattern for distributed stream processing across fog computing infrastructure.

1.2Aim of the Work

The aim of this thesisistoestablish adistributedstreamprocessing system forfog com- putingwhichutilizes the lambdaarchitecture design pattern. Thefollowingparagraphs define the expected outcome and theobjectives of thiswork.

Lambda Architecture Theproblem statementalready discussed that currentstate- of-the-art streamprocessing solutions arenot suitablefor processingdatainreal-time in the contextofthe IoTwhile alsoefficientlyutilizing historical data. In thecontext of data analytics,fog computingand thelambdaarchitecture followsimilar goals.Amongothers, the lambdaarchitecture tries to optimizethe latency by using precomputed batches andaspeed layertostore the most recent data. Although thisisalreadyasignificant improvementcomparedtofully incremental solutions (e.g.,whereanapplication directly reads fromand writes to one database [15]),the data is stilllocatedinthe cloud. Querying data in the cloud is often not suitablefor latency-sensitiveapplications [17]. Fogcomputing tries to compensatefor those use cases by extendingthe cloud and providingadditionalresources at theedgeofthe network [17]. Although it mightbe necessary to store historicaldatainthe cloud (batchand serving layer), acentral data store for thespeed layershould not be used. Thedataofthe speed layershould be stored at theedgeofthe network to keep thenetwork overhead and thelatenciesataminimum. Thiswork’sprimary goal is to identify howthe lambdaarchitecture canbeutilizedina fogcomputing environment. Furthermore, this work aimstodetermine howthe lambda architecture can be implemented with respect to distributedstreamprocessing in the fog and howbatchand real-timeviewscan be synchronized.

Distributed StreamProcessing in theFog With the increase of datavelocity and datavolume, centralizedsolutions are no longer sufficientfor latency-sensitive IoT applications.Althoughitcan be very beneficialtouse fog computingfor distributed stream processing, suchframeworks are not optimized for thefog [10]. Distributed stream processing frameworksare designed forhomogeneous nodeswithin alocal cluster with

2 1.3. Methodology and Approach lowlatency.Extending the cloud with computationalresources at theedgeofthe network is often not suitable without addressing therequirements of the fog. Thiswork aimstoidentify currentresearchapproachesand determinewhether established stream processing frameworkscan be usedacross fogcomputing infrastructure and how they canbeoptimized forthe heterogeneous resources of thefog. Notably,this work focuses on implementing thelambdaarchitecture over fog computinginfrastructure rather thanimplementing thenext stream processing framework for thefog.

Solution Approachfor FogComputing Within the scopeofthis work, asolution approach is designed and implemented. Thesolutionapproach’s main goal is to process data close to the data sources to avoid high latencyand bandwidth usage to thecloud. Instead of sending allthe data to the cloud, process it there, andsendthe results back, the ideaistoprocess data streams over fog computinginfrastructure in adistributedway overseveral fog nodes. The solutionapproach should use the lambdaarchitecture design patterntoprocess massivevolumes of historical datawhile simultaneously using stream processing to provide real-time analysisofcontinuous data streams. Furthermore, the implementedsolution approachaims to addressthe heterogeneous and dynamicresources of thefog. Among others, thefollowing questionsare addressed by the solutionapproach regardingfog computing infrastructure:

•How can nodes in thefog be orchestrated?

•How can services be deployedonfog computing infrastructure?

•How much computational resources does anodeinthe fogneed?

•How can applications on fog computing infrastructurebetested?

•How can fogcomputinginfrastructure be simulated or emulated?

1.3Methodologyand Approach

The methodology of this work follows adesign science and systemresearchapproach[18, 19, 20]. Forthis purpose,existing artifacts in thecontext of the problemstatementare analyzed. Basedonthe findings and theproblemstatement, solution approachesand requirements are defined to improve the artifacts. The approachapplied within this thesis canbegrouped as follows: At first, existing artifacts are analyzed. Next, aprototypeutilizing the improvedartifacts is constructed. Finally,the prototype is evaluated by the defined requirements.

Analysisofexisting artificats As of today, the fields of bigdataand stream pro- cessing as well as thearea of fog computingare under extensiveresearch. Also, stream processing at theedge of the network is notanentirelynew idea. However, there is –to

3 1. Introduction

thebestofour knowledge –noactiveresearch regardingthe lambdaarchitecture design pattern acrossfog computinginfrastructures.

In afirst step, problems and challengesofcurrentresearchare identified andadequately addressed with respect to theproblemstatement. This identificationprocess focuses on thethree primary artifacts of this work: thelambdaarchitecture design pattern, distributed stream processing,and fog computing. Forexample, it needs to be evaluated if existingdistributed stream processing frameworks, like Apache Storm, are suitableto solve theproblemstatement. As it is not feasible to evaluate all existing frameworks,the most promising frameworksand researchapproachesare analyzed indetailtoidentify which components are suitabletosolve the problemstatement. On avery high level,the distributed stream processingsystemmustbesuitablefor fogcomputing and must support the requirements of thelambdaarchitecture design pattern. Furthermore, possiblestorage solutions for lambdaarchitecture (and especiallyfor thespeed layer) needtobeidentified.

Basedonthe findings, functional as well as non-functional requirementsfor implementing the lambda architecture for distributed stream processing in thefog are defined.

Design After analyzingexistingartifacts and thedefinition of therequirements to addressthe problemstatement, aprototypeisconstructed according to therequirements. As this work focuses on theimplementation of thelambdaarchitecture over fog computing infrastructure,ratherthanimplementing thenextstreamprocessing framework for the fog, theconstruction process focusesonusing already established stream processing frameworks and howthey canbeoptimized forthe fog. The same applies to other artifactssuchasdatastores, services, tools, and technologies in general.

Evaluation Last but notleast,the solutionapproachisevaluated with respect to the requirements as well as non-functional aspects. The evaluation uses common metrics in the fieldofbig data and distributedstream processing [10, 21]. Forexample,for a quantitative evaluation,latency, throughput, scalabilityand resourceutilization could be usedaskey indicators.

In this work, the evaluation focuses on distributed stream processing in the fog and on howwellfog-based stream processingtopologies perform compared to atraditionalcloud approach. To determineifthe solutionapproachaddressesthe problemstatement, the evaluation measures theround-triptime of datapackets.

In order to run the experiments, atestbed is established to emulate arealistic fog computingenvironmentonvirtualizedinfrastructure. Thetestbed has been bootstrapped with MockFog[22]. Throughout the evaluation of thesystem,the best practices for performance analysis as described in the book“The Art of Computer Systems Performance Analysis” by RajJain are followed [23].

4 1.4. Structure

1.4 Structure

After this briefintroduction, the remainder of this work is structured as follows:

• Chapter2providesanoverviewofrelevantbackground information and important conceptsthatare usedwithin this thesis.This includesadescriptionoftheIoT, fogcomputing, bigdataanalytics with afocus on open-source distributed stream processingframeworks and the lambdaarchitecture design pattern.

• Chapter3discusses state-of-the-art researchand related work regardingdistributed stream processing for the IoT and fog computing as well as thelambdaarchitecture. Furthermore, this chapter addressthe currentresearch forevaluating and testing of IoT-based applications in thefog.

• Chapter4describesthe required functionalitytoaddress the aforementioned problemstatementand subsequently presents the architecture of thesolution approach.

• Chapter5discusses howthe proposed architecture can be implemented from amore technical pointofview,includingthe hardware and software configuration of fog and cloud infrastructure,how fog nodes canbeefficientlyutilizedand orchestrated, and thepresentationofthe implementation of thelambdaarchitecture as well as the implementation of non-functional requirements.

• Chapter6describes the evaluation of thesolution approach. Startingwiththe discussion of possibledatasetsfor theevaluation, thesetup of thetestbed,design of thestream processing topology and the discussion of theresults.

• Chapter7outlines essential findings of implementing thelambdaarchitecture for distributed stream processing in thefog. Last but not least, possible future work is discussed.

5

CHAPTER 2

Background

Thischapter gives an overviewofrelevantinformationand importantconcepts that are usedwithin this thesis.First, an introductionofthe Internet of Things (IoT) is given. Second,fog computingisdescribed.Adetaileddescription of bigdataanalytics follows with afocus on open-source distributedstream processing frameworks. Last butnot least,the lambda architecture design pattern is discussed.

2.1 Internet of Things

2.1.1Overview The term Internet of Things has itsorigininusing Radio-FrequencyIdentification (RFID) in the fieldofsupply chain managementand hasmost likely first been mentioned by Kevin Ashton in apresentationatProcter&Gamble in 1999 [24]. The concept of connected smartdevices has already been discussed in 1982 at theCarnegie MellonUniversity (CMU),connecting aCoca-Colavending machine to the Internet. In 2005, theterm IoT wasformallyintroduced in the InternetReports of theInternational Telecommunication Union (ITU) [24, 25]. Sincethen and especiallyinrecentyears, theIoT gained alot attention in academic research and the industry. TheIoT canbeseen as aconnection of four maincomponents: things, data, people and processes [24, 26]. The following list gives ashort description of these components:

• Things refer to physical devices and objectsthatare connected to anetwork or the Internet. IoTdevices must be abletocommunicatewith other devices or the Internet, to send some sort of data(e.g.,sensor data) andtoperformcertain actions. • Data refers to the data thatisproduced by IoT devices and used in smartdecision- making processes.

7 2. Background

• People are thecreators and consumersofIoT services. The IoT connectspeople in morerelevantwaysand helps them to be moreinformedand makebetter decisions.

• Processes makeitpossibletoperformefficientdecisions with therightinformation at therightplace at therighttime (e.g., intelligentautomation, informeddecision- making).

Today, there exist many definitions of thetermIoT, as alot of companies,likeIBM, SAP,Gartner, Cisco, came up with theirown interpretation of what theIoT means[26]. However, awidely-usedcomprehensivedefinition of IoT from theITU reads as follows: “A global infrastructurefor the information society,enabling advanced services by in- terconnecting (physical and virtual)things basedonexisting andevolving interoperable information and communicationtechnologies” [27].

2.1.2 IoTElements andEnabling Technologies Looking on state-of-the-art research, it can be observed that thereisascientific consensus on howthe IoTlandscapelooks like[28, 29, 30, 31]. Forexample, Al-Fuqaha et al.cate- gorizesthe IoTlandscapeintosix building blocks: identification,sensing, communication, computation, servicesand semantics [30]. Gubbi et al. break it down into threecategories, including the actual hardware (senors, communication), middleware (storage and data analytics)and presentation (visualization and interpretation of data) [31]. Although the authorsdefinedaslightly differentnaming scheme, it boilsdowntothe averysimilar definitionofIoT elementsand enabling technologies. In thefollowing sections, the most importantIoT elements togetherwith their enabling technologies are discussed.

Identification Objectidentification is avery importanttask in theIoT [32]. Objects refer to real-world thingslikeproducts or devices.Today,there existmanymethods to automatically identify objectsand as aresultthe data thatisproduced by those objects. An examplewe encounterevery dayis, e.g., theElectronic Product Code (EPC), whichisused to identify goodsinsupply and demand chains. Anotherexamplesare Auto-IDtechnologieslike RFID tags, which can be applied to real-worldobjects of all kind.Inadditiontothat, IoT objectslikesensing devices need aunique addresswithin acommunication network [30]. Suchidentification depends on theused infrastructure protocol. Examplesfor such protocols are IPv6over Low-Power Wireless Personal Area Networks (6LoWPAN), ZigBee or BluetoothLow Energy.One of themostoftenused protocol is 6LoWPAN, which uses IPv6 addressfor objectidentification withinanetwork [32].

Sensing IoTdevicesproduce data.IoT sensing describes theprocess of gatheringsenor data (e.g., data fromtemperatureorhumidity sensors) forstorage and dataanalysis [30]. Depending on thecapabilities of thesensor devices and thegeneral architecture,data

8 2.1. Internet of Things

can be either directly used on theIoT device itself or collected, stored,and processes at acentralized infrastructure [31, 33]. Thisisoften achieved by sending the senor data to data warehouses.Suchwarehouses are usuallylocatedinthe cloud, where cheap data storage is available.

Computation Computation referstoactualhardware and software thatisused to run IoTdevices. There exist manydifferentmicrocontrollersand microprcessors(e.g., Raspberry Pi, Arduino) and lightweight operatingsystems(e.g.,TinyOS) whichare explicitly designed forIoT applications [32]. As has already been described in theprevious section, the cloud providesadditional resources and storage to process vast amountsofIoT data.

Communication The use of efficientcommunication technologies is very important, as IoT devices are often notvery powerful and resources are limited. Furthermore, the communication can be very lossyand noisyasIoT devices are usuallylocatedonthe edgeofthe network or move within the network [32]. The IEEE 802.15.4technical standarddefinesthe operation of Low-RateWirelessPersonalArea Networks (LR-WPANs),whichformsthe basis fornetwork infrastructure protocols that focusonlow-powercommunication between IoT devices,suchas6LoWPAN or ZigBee [30, 32, 34]. Besides, the 802.15.4technical standard, other lowlevel protocols (datalink) forthe IoTinclude Bluetooth LowEnergy, Long Range Wide Area Network (LoRaWan) and WiFi [34]. Among others, widely-used application protocols for IoT applications are:ConstrainedApplication Protocol (CoAP), Message Queue Telemetry Transport (MQTT), AdvancedMessage Queuing Protocol (AMQP)orHTTP Representational StateTransfer (REST). Figure 2.1 gives an overview of prominent communication protocols forthe IoT. Efforts for standardization are ledby various organizations like World Wide WebConsortium(W3C), Internet Engineering Task Force (IETF), EPCglobal, InstituteofElectricaland Electronics Engineers (IEEE) and theEuropean TelecommunicationsStandards Institute(ETSI).

Services CurrentresearchcategorizesIoT servicesintofour classes [30, 32, 35]:

• Identity-relatedServices provide information aboutthe device (e.g., RFID is a prominenttechnology usedinidentity-related services).

• Information Aggregation Services collect and processdata from various sensors and perform actions based on the processed data.

• Collaborative-Aware Services are used to makedecisions based on aggregated data.

9 2. Background

Figure 2.1: Prominentprotocols forthe IoT(adapted from [32])

• Ubiquitous Services provide collaborative-aware servicesfor everyone, everything and at anytime.

Knowledge extraction In order to takeactions based on thegathered data, thedataneeds to be analyzedto extract knowledge and takecertainactions. To do so,datamining and machine learning tools can be usedtoconvert informationintoknowledge[31]. Furthermore,semantic webtechnologieslikethe ResourceDescription Framework (RDF) and theWeb Ontology Language (OWL) could be usedinthe decision-making process [32].

2.1.3 FieldsofApplication Sincethe introductionofthe IoTin1999, theIoT landscapehas been developedrapidly. Today, there exist alot of application scenarios in very differentareas where theIoT is usedorcan be usedinthe future [30, 29, 31].Firouzi et al. [26]mention several vertical markets in whichthe IoTplays ahuge role.Inthe following paragraphs, we takeaquick look at howthe IoTinfluenced some of those markets.

SmartManufacturing The IoT as well as other emerging technologies like cloud computing,big data analytics and artificial intelligence are keyenabling technologies for smartmanufacturing[36]. Especially,IoT devices(e.g., sensors) makeitpossibletoincrease the degree of automation, allow for Machine-to-Machine (M2M)communication,self-monitoring and complete manufacturingtaskswithout or with minimal human interaction [29]. The ongoing transformation of traditional manufacturingtosmartautomated manufacturingisalso known as the FourthIndustrial Revolution (Industry 4.0). The useofthese technologies

10 2.2. FogComputing addresses current challengessuchasoptimizing productivity, higher qualityand quality control or to produce individualizedproducts efficiently [36].

Transportation Anotherverypromising areafor IoT application is thefieldoftransportation,whichcan range from assisted/automateddrivingtosmartstreets and smarttraffic signals up to collision avoidance and monitoringofhazardousmaterialsonthe road [29]. Cars and infrastructure equipped with sensorsand technologies like Vehicle-to-Infrastructure (V2I), Vehicle-to-Vehicle (V2V),Vehicle-to-Infrastructure-to-Vehicle (V2I2V) makeitpossible to build intelligenttransportationsystemstoimprovesafety,navigation and safe valuable resources [37, 38].

Logistics In Section 2.1.1, theusage of RFID in supplyand demandchain management has been discussed briefly. RFID allows forreal-time monitoringand controlofevery link in a supplychain, e.g., material acquisition,manufacturing, storage, distribution,warranty processing or after-saleservices [29]. The chosen examples above illustrate howthe IoTcan be usedtoimprovesuchapplications. There exist manymore promising areas and application scenarios where IoT technologies can be used, suchasSmartCity, Smart Home,SmartAgriculture,SmartGrid, Smart Building or even healthcare [39, 40, 41, 42, 43, 44, 45].

2.2Fog Computing

2.2.1Overview To giveaprofound definitionoffog computing, we first need to takeastep backand define some characteristics of thecloudcomputing model. The National Institute of Standards and Technology (NIST) defines cloud computingasfollowing: “Cloud computingisa modelfor enabling ubiquitous,convenient, on-demand network access to ashared pool of configurable computingresources(e.g.,networks,servers, storage, applications,and services) that canberapidlyprovisionedand releasedwithminimalmanagement effort or serviceprovider interaction” [46]. Cloud computing follows apay-as-you-gobusiness modeland is an efficientalternative for operatingprivate data centers[17], as cloud computingoffers arichset of features such as on-demand self-service, broad network access,resourcepooling,metering capabilityand rapid elasticity[46]. Economy of scale effectslet to huge datacenters at rather inexpensivelocations withhomogeneouscompute, storage, and networking components [17]. Duetothis development, cloud computing is often not suitablefor specific application requirementslikelatency-sensitive applications, geographically distributed applications or applications that are not fixed at one place [17]. Fogcomputing tries to compensatefor those use cases by extendingthe cloud and providingadditionalresources at theedgeofthe network [17]. Additionalresources

11 2. Background

Figure 2.2: Fogcomputing environment(adapted from [47])

refer to avarietyofheterogeneousdevices with computation, storage and networking capabilities.Figure 2.2 shows apossiblefog computingenvironment, where the fog providesadditional resources for IoT devicesatthe edge of thenetwork [47].

2.2.2 Characteristics The characteristics of cloud computinghas been already discussed very briefly.Nextwe want to define the specific characteristics which only applytofog computing[24, 17]:

• Location andlatency: Services and applications hostedinthe fogare close to data sources and consumers.This allowslow latency and location-aware operations.

• Geographical distribution: Fogcomputing is independent from datacenters and allowstoprovide resources whereverthey are needed (e.g. road sideunits, computational resources deployed across aroad network [17]).

12 2.2. FogComputing

• Large number of nodes: Fogcomputing consists typicallyofmanynodes with ratherlow processing capabilities compared to Cloud Computing (e.g., sensor devices deployedoveracity)

• Mobility: Nodesinthe Fogcan be staticormovearound (e.g. mobile phones, automated driving [37, 38]).

• Real-time interaction: Fogcomputing allowsfor real-time analysis of streaming data and is not very suitable forbatch processing on large (historic)data piles.

• Heterogeneity: Nodes in theFog can be very diverseand are usually not homo- geneous with equal computation, storage and networking capabilities.

• Predominance of wireless access: Due to the needoflocation-independent deploymentofnodes, awiredgigabitethernet connection is oftennot available.

• Analytics andcloud interplay: Fogcomputing enables low-latencyreal-time analyticsclosetothe data andcan useadditional resources from theCloud if needed.

2.2.3 EnablingTechnologies and Challenges Advances in network mobility, software virtualization, IoTand application technologies as well as theavailabilityofpowerful(mobile) hardware are keydrivers forcomputing and offloading tasks into theFog [24, 48, 49, 50]. Inthe followingparagraphsthe most important, influentialenabling technologies and challengesfor fog computingare discussed.

Wireless technologies The IoT and fog computingshare alot of enablingtechnologies. Especially,the communication technologies mentionedinSection 2.1.2apply to fog computingaswell.Recentdevelopmentinwireless broadband communication (IEEE 802.11ac,LTE, 5G) makeitpossibletoprovide fast connectivitytomobile devices like phones or cars [49]. The IoT is constantlyevolving and newcommunication standards like the 5th generationnetwork (5G) willsignificantlycontributetothe developmentofthe future IoTbyenabling latency-sensitiveapplication on amassive scalewhile providingup to 10 Gigabits persecond bandwidth[51, 52]. Another–already mentioned –prominent wirelessalternativeisLoRaWan (LongRange Wide Area Network). LoRaWan has been designed as along-range,low-powerand lowdata ratecommunication protocol [53,54]. Depending on the areaitcovers arange of up to 15 km with amaximum bandwidth of 27 kb/s (50 kb/s with frequencyshift keying)[54]. Duetothe lowpowerconsumption, LoRaWan devices feature abatterylifetime of up to 10 years.ALoRaWan network is organized as astar topology.Several star topologies can be combinedintoastar-of-stars topology.Eachstartopology features agateway to exchange messages[54].

13 2. Background

Virtualization and containment Virtualization techniquesplayanimportantrole to reduce complexityinheterogeneous fog environments.Virtualizationoffers an ab- straction layerofthe underlying hardware and allowstologicallydistribute and manage resources [49]. Furthermore,itprovidesportabilityand allowstomoveVirtualMa- chines(VMs) from one place to anotherortoreuse alreadyexisting VMs.The same applies to containers. Compared to virtual machines, containersdonot need to virtualize the whole operatingsystem.Instead, containers useahost operatingsystem and are therefore morelightweight[49]. Virtual machinesaswell as containersenable sandboxing and provide asecure isolatedexecutionenvironmentonfog devices.This is especially importantasfog devices can be deployed anywhereand are not necessarilyprotected fromunauthorizedaccess [24]. Proving securityinthe fogisstill abig challenge and needs to be addressed adequately(e.g. mechanisms to detect tampering) [24].

Besides thevirtualization of devices,there is alsoaneedtodeployand maintain secure fog networks in avirtualizedmanner. Network Function Virtualisation (NFV) and Software DefinedNetwork (SDN)allowtoeasily configure,updateand manage networks [48]. SDNsmakeitpossibletocreate virtualnetworks on top of existingphysical infrastructure. NFV allowstodynamically deployon-demand network servicessuchasfirewalls,routers or LANs. The “softwareisation”ofnetworks allowsfor cheaper and moreagile operations, as hardware-basedinfrastructure likeroutersare notneeded [48].

Hardware-relatedenablers Recentadvances in thechip industry (e.g.,SoC1)allow forsmallerand more powerful processors anddeviceswhichconsume less powerand provide better batterylife [48]. Smaller devices result in lowerproductioncostsand enableabroader deploymentofIoT nodes. Morepowerful hardware and especially multi-core chips allow to run multiple jobs in parallelfor work-intensivetasks[49].

Paradigms and architectures IoT technologies and orchestration of IoT nodesinthe fogmakeitpossible to efficientlymakeuse of theexisting infrastructure[24, 49, 55, 56]. Aazam et al. mention several IoT architecture typessuchascloudlets,Mobile Edge Computing(MEC), micro/nano datacenters,orfemto clouds[49]. Whilenot allofthem refer to the traditional fogcomputingparadigm, all operateinthe fogand helptoreduce energy consumptionand latency [49]. Cloudlets, MEC, microand nanodatacenters followthe ideatoprovide additional resources on theedgeatthe network to enable low latenciesfor IoT applications. Forexample, cloudlets canbedescribed as small, mobile data centers at theedgeofthe network. Cloudlets provide largecomputational resources, ahigh speed internet connection to servicesdeployedinthe cloud and lowlatency to mobile and IoT devices.Femto cloudsfollowadifferentapproachand do notprovide additional resourceatthe edgeofthe network. Femtocloudstry to orchestrateexisting co-locatedmobile and IoT devices and use their resourcetoperform certain tasks [49, 57].

1System-on-a-Chip

14 2.3. BigDataAnalytics

Furtherchallenges Fogcomputing is stillatits very beginning and thereaare a number of challengesthat need to be addressed.Aazam et al. mention several challenges foroffloading tasks from thecloudtothe fogtomeet common application requirements suchasreliabilityand high end-to-endperformance [49]. Besidesthe aforementioned securityissues, resourceallocation/limitations,scalability, heterogeneity issues, fault tolerance, energy consumption, monitoring, standardization and monetization need to be addressedadequatelytoprovide acloud-likeexperience [47,48, 49].

2.2.4Fields of Application There existsawide range of possibleapplicationscenarios that would benefitfrom additional computational resources at theedgeofthe network. Forexample, smart manufacturing and theIndustry 4.0 movementuse IoT devices (sensor networks,actuators, robots,machines) to increasethe degree of automation.Due to securityand latency requirements, most of thetasksneedtobeperformed locally within acompany[58]. Fog computingcan help to addressthese requirementsbyprovidingadditionalcomputational resources at theedgeororchestratingexisting resources forsecurity- or latency-sensitive tasks. Anothervery promising use case forfog computingisdataanalysis. Bo et al.introduce afog computingarchitecture forbig data analysisinsmartcities [59]. The authors argue that location awarenessand lowlatencies provided by the fogare essential to build the smart citiesofthe future. Yi et al. argue that the combinationofcloud and fog computingcan help to buildasmart, performant large-scale environmentmonitoring system [60]. Suchasystemcan be usedfor emergency casesliketoxic pollutionalerts. While fognodes are used to aggregate and analyze local datatoprovide fast feedback, the cloud canbeused for resource-intensivetasks likeafurther detailed analysis. Furthermore, Dastjerdietal. mention healthcare, activitytracking,smartutilityservices, augmented reality, cognitivesystems, and gaming as possible application scenarios for the fog[47].

2.3Big Data Analytics

Bigdataanalytics canbecategorized into two paradigms: StreamProcessing and Batch Processing [61, 62]. Thiscategorization is based on howfrequently data is generated respectivelyhow much timeittakes to analyzethe data once it has been generated (data velocity). Depending on theuse case,bothparadigms are essential in bigdataanalytics. The value of datafor stream processing depends on howrecent the data is (data freshness) and thereforedataneeds to be analyzedassoonaspossibleonce it is generated (e.g., sensor datawhichare onlyvalid afew millisecondsuntil the sensorprovidesnew data andthe olddataisobsolete) [62]. The second paradigm is used to analyze bigpiles of data,wheredatafreshnessdoesnot matter. Forbatchprocessing,dataisfirst stored properly and analyzed afterwards, often together with other existingdata[62]. Atypical

15 2. Background

use case forbatch processing is to analyzehistoric data to gain insights into adomain or to predict howthe data willevolveinthe future. Depending on the data velocity, there exist several frameworks, which can be usedfor stream or batch processing or even both [63]. The MapReduce programming modelincombination with Apache Hadoop is aprominent examplefor batch processing. There,inputdataisstoredinadistributedfile system, e.g., theHadoop DistributedFile System(HDFS). Althoughthis technology existsfor quite some time now, it is still usedheavily.One of thebiggest competitorsisthe [9]computational framework.Depending on theuse case,itoutperforms the Hadoop MapReduce framework [64]. With the emergence of Internet-scale applications in recent years,new distributed stream processing frameworkshavebeen developed, as traditional centralized approacheswere notable to handle theever-increasing amountofdata in real-time [65]. Examples for currentopen-source stream processing solutions are Apache Storm [6], Apache Heron [8], Apache S42,ApacheSamza3,ApacheFlink [7], Apache Spark Streaming[9] and others. Besides traditional frameworks, thereexist commercialand public cloud solutions like Amazon KinesisStreams [66], Google Dataflow4,AzureStream Analytics5 Furthermore, Kale gives adetailedoverviewofthe first academicand researchpro- totypesystems for (distributed) data stream processing[67]. Suchsystems include TelegraphCQ [68], STREAM[69], Aurora [70], Borealis [71], StreamIt [72], IBMSystem Sand IBMSPADE (TheSystemSDeclarativeStream Processing Engine) [73]. Some of those research projects evolved into commercialsystems, suchasIBM InfoSphere Streams [74], which is based on theSystemSresearch prototype.Although, current state-of-the-artdistributed stream processingframeworksare highly influenced fromthe aforementioned research projects, we will not focusonthose projects in more detail. Besides that,wewill notdiscuss batchprocessing frameworks, as this work focuses on distributed stream processinginthe fog.

2.3.1 DistributedStreamProcessingFrameworks In thefollowing paragraphs, theaforementionedopen-source distributed stream processing frameworksare described in furtherdetail, with theexceptionofApache S4, which is no longer being developed actively.

Apache Storm Apache Storm is areal-time distributed processing system [75]. Itisscalableand fault- tolerant, makes surethateachdata tuple is processed at least once and typicallyruns on commodityclusters with many nodes. Applicationsinthe Apache Storm world are

2http://incubator.apache.org/projects/s4.html,Last accessed: 2021-01-29 3http://samza.apache.org,Last accessed: 2021-01-29 4https://cloud.google.com/dataflow,Last accessed: 2021-01-29 5https://azure.microsoft.com/en-us/services/stream-analytics,Last accessed: 2021-01-29

16 2.3. BigDataAnalytics

called topologies. Atopology defines howthe data is processed and howthe data moves between nodes. It consists of one or moredatasources(data spouts) and operators (bolts), which perform arbitrary functions on tuples (see Figure2.3). Typical operations include filtering,processing or storing[13]. Topologies operateonanendless stream of data tuplesand run forever until they are shut down.

Figure 2.3: Apache Storm topology (adapted from [6])

An Apache Storm cluster consists of two differenttypes of nodes: amaster node (runs adaemon called Nimbus)and one or moreworker nodes(eachnoderuns a Supervisor daemon) [6]. The master node assigns tasks to theworker nodes and monitorsthe whole cluster. Worker nodes canspanone or moreworker processesand execute thetasks assignedbythe masternode. Forcommunication between nodes, Apache ZooKeeper6 is used. Figure 2.4 shows the maincomponents of aStormcluster. Furthermore, Apache ZooKeeper is used to keep trackofthe currentstate within a topology.The Nimbus node as well as theSupervisors nodes arestateless,whichresults in amore stable cluster,asthe masterand worker nodes canbereplaced easily.

Apache Heron Apache Heron is areal-time,distributed, fault-tolerantstreamprocessing engine created by Twitter. Heron wasbuild to addressissues related to scalability, debug-ability, manageabilityand cluster resources allocation of Apache Storm [8]. In order to improve debugging, profiling, and troubleshooting, topologies are process-based –with each process running in isolation –and notthread-based likeinStorm [76]. Furthermore, Heron has abuilt-in back pressure mechanism to regulate theimbalance between the incoming datarateand the processing rate. An empirical evaluation has shownthatwith

6https://zookeeper.apache.org,Last accessed: 2021-01-29

17 2. Background

Figure 2.4: Apache Storm cluster (adapted from[75])

this new architecture,Twitter wasable to reduce CPUusage, improve throughput and reduce tuple latencies [8]. Heron replacedStorm as thestreamdataprocessing engine insideTwitter and is currentlyinincubationofthe Apache Software Foundation7.It is backward compatible with topologiesofApacheStormand advertised by Twitter as an improved implementationofStorm[8]. Figure 2.5 shows the topologyarchitecture of Heron.The architecture is based on containerswhereeachcontainer runsaStream Manager,aMetrics Manager and several Heron Instances.Heron instancesare the counterpart to spoutsand bolts in Apache Storm.

Apache Spark

Apache Spark advertises itself as aunifiedanalytics engine forlarge-scaledataprocessing, which provides –due to in-memory persistencecapabilities –muchfaster execution timesthan MapReduce[77, 78]. TheApache Spark stack consistsofseveral components like Spark Core, Spark Streaming, Spark SQL, MLlib formachine learningand GraphX forgraphcomputation.Besidestraditionalbatch processing,Spark supports scalable, high-throughput, fault-tolerantdistributedstreamprocessing [77]. SinceSpark 2.x,Spark alsosupports StructuredStreaming,anew scalable andfault-tolerantstreamprocessing enginebuilt on theSpark SQL library.UnlikeStorm or Heron,Spark Streaminguses micro-batchesfor stream processing (seeFigure 2.6). Insteadofprocessingeachdata record separately, micro-batching accumulates multiple recordsofdata (e.g., over the last few seconds) before processing them.Although the usage of micro-batchesincreases thethroughput, micro-batching does not allowtruereal-time processing and is therefore notsuitable forlow latency requirements.

7http://incubator.apache.org/projects/heron.html,Last accessed: 2021-01-29

18 2.3. BigDataAnalytics

Figure 2.5: Apache Heron topology architecture (adapted from [8])

Figure 2.6: Apache Spark Streaming (adapted from [9])

Spark Streaming usesadiscretized stream (DStream),whichrepresents astreamof ResilientDistributedDatasets (RDDs). An RDD represents an immutabledistributed collection of objects, split into multiple partitions to provide fault tolerance. On the other hand, StructuredStreaming uses Dataframes and Datasets to process incoming data.Dataframesare structured RDDs and allow to applyany SQL queryonstreaming data.Datasets are an extension of Dataframestoprovide atype-safe,object-oriented programming interface. Spark 2.3 introduced anew processingmodecalled Continuous Processing for low-latency stream processing.

Apache Flink LikeApacheSpark, Apache Flink is adistributedprocessing framework forstream processing as well as batch processing. It is open-source, scalable, fault-tolerant, supports ahigh throughput as well as lowlatencies andcan be easily integrated into theApache Hadoop ecosystem. UnlikeSpark Streaming,itisatrue streaming framework and does notuse micro-batchestoprocessstreamingdata. Compared to Apache Storm, Flink

19 2. Background

providesamore high-level API, which providesdefaultfunctions like map, reduce,and aggregate out of thebox.Furthermore, it supports stateful stream applications and guarantees that dataisprocessed exactly-once,while Storm guarantees at-least-once processing.

Apache Kafka

Apache Kafka is adistributedstreaming platform and not necessarilyastreamprocessing framework.Kafka is often used as amessage broker to transfer databetween applications, nodes or systems. It usestopics to categorize streams of datarecords and apublish- subscribeparadigm on topics. Kafka is highlyscalable and designed to handle arbitrary amountofdata as topicsconsist of partitionswhichare distributed across thecluster. On top of that, Apache Kafka offers alibrary forprocessing and analyzing datastoredin Kafka called KafkaStreams.Kafka Streams is highly scalable,fault-tolerant, guarantees exactly-onceprocessing and does not requireacluster fordata processing.

Apache Samza

Apache Samza is areal-timedistributed stream processing framework.Samza relies on Apache Kafka forstreaming and Apache YARN (Yet AnotherResource Negotiator) as a cluster resource manager. Besides resource management, YARN provides fault-tolerance, process isolation and thesecurity modelofApache Hadoop. Samza supports stateful stream processing, offersat-least-onceprocessing,high scalability, lowlatencyand high throughput. Samza is tightly coupled with Apache Kafka butalsosupports other input sources like HDFS.

2.4 LambdaArchitecture

The lambdaarchitecture design patternhas first been introducedbyNathan Marz(the creator of Apache Storm) [15]. The mainidea of thelambdaarchitecture is to buildbig data systems as astackoflayers [4]. Typically, thelambdaarchitecture consists of three layers: speed layer, serving layerand batchlayer.The interplayofthose three layers makes it possibletoprocess massivevolumes of historical batch data, while simultaneously using stream processing to have areal-time analysis of continuous datastreams [79]. The architecture allowstooptimizelatency, throughput, andfault-tolerance of long-running queriesbyproviding accurate datafrom the batchlayer and recent data fromthe speed layer. Nathan Marzdescribes thelambdaarchitecture with three equations[15]. Batch viewsare computedoverthe masterdataset (see Equation 2.1),real-time views are created by addingnew incoming data to the real-time views(seeEquation2.2)and queriesare resolved by abatchview and real-time view (see Equation 2.3).Figure 2.7 gives an overviewofthe interplaybetween thedifferentlayers. In thefollowingsections, eachlayerisdescribed in moredetail.

20 2.4. LambdaArchitecture

Figure 2.7: Lambdaarchitecture (adaptedfrom [15])

batchview = function(alldata)(2.1)

real-timeview = function(real-timeview,new data)(2.2)

query = function(batchview,real-time view)(2.3)

2.4.1 Batchlayer The batch layerisalarge datastore,which holds all thedata. The idea behindthis layeristoprecomputebatch viewsondata whichisused often.This hasthe benefit thatitisnot necessarytoprocess huge amounts of dataevery time (predefined)queries should be resolved. Insteadofcomputing theresults of aqueryon-the-fly,the results

21 2. Background

are read from aprecomputed view. In other words, the batchview allowstoget the query data very quickly,without processing the whole datafrom groundup. In most cases, calculating those batch viewsare high-latency operationswhichtakealong time to finish. Furthermore, batchviewsneedtoberecomputed continuously if newdatais available. As aresult,new arrivingdataisnot included in batch viewsfrom the very beginning and must be providedinaseparate layer(seeSection 2.4.3).Precomputed viewsshould allow random reads and must be therefore be indexed. Atypical example for abatchprocessingsolutionisApache Hadoop.

2.4.2 Servinglayer

In order to makebatchviewsaccessible, afurtherlayer is used, called serving layer. As soon as anew batch is available, theserving layerreplacesthe new batchwith the old one.The serving layerconsistsofadatastore, which allows to swap batches and random reads on thosebatches. Random writesare not required as thedatastore onlyreplaces full batchesfromthe batchlayer.

2.4.3 Speed layer

The purposeofthe speed layeristoprovide the most recentdata, which are not represented in the batchviews. This layerholdsdata which arriveduring the (re)computation of batchviewsand compensates for thehigh latencyofupdates to theservinglayer. The datastore is continuously updated with new data.After the batchlayerrecomputed the batch viewsand the servinglayerreplaced those newbatch with the old ones, thespeed layerdiscards hisdata. Compared to thebatch layer, the speed layeruses incremental algorithms to updaterealtime views instead of recomputing theviewsfrom groundup. Therefore, datastoresfor thespeed layermust support randomreadsaswell as random writes.

2.4.4 Summary

Compared to fully incremental solutions(e.g.,where an application directly reads from and writestoone database)the lambdaarchitecture providesbetter accuracy, latency, and throughput [15]. Additionally,the lambdaarchitecturereduces complexityand improvesfault-tolerance,scalability, maintainability, extensibility and debuggability [15].

In thecontext of dataanalytics, fogcomputingaswell as thelambdaarchitecture follow similar goals.Amongothers,the lambdaarchitecture tries to optimizethe latency by using precomputedbatchesand aspeed layertostore the most recentdata. Although this is already abig improvementcompared to fully incremental solutions,the data is still located in the cloud. Querying datainthe cloud is often not suitablefor latency-sensitive applications [17]. Fogcomputing tries to compensatefor those use cases by extending the cloud and providingadditional resources at theedgeofthe network [17].

22 2.4. LambdaArchitecture

As already stated in the introduction, the lambdaarchitecture is typicallyimplemented in the cloud. However, the lambdaarchitecture is very versatile and does not depend on anyparticular technology or tools.Therefore, the usage of thelambdaarchitecture in combination with fog computingmakes alot of sense.For example, thedifferentlayers of thelambdaarchitecture could be implemented across fog computinginfrastructure to further optimize thelatency.Especiallythe speed layercould be implemented in the fog, as thespeed layeronly stores the most recentdata. Moreover, the speed layeris periodicallycleared as soon as newdataisincludedinthe batchviewsand the views are available inthe serving layer. As aresult,the speed layerdoesnot requirealotof storage space comparedtothe batchand servinglayer.

23

CHAPTER 3

Related Work

In thefollowing chapter, state-of-the-art researchand related work in distributed stream processing are discussed.First, currentresearchabout distributed stream processing for the IoTand fogcomputing is addressed.Second,related work with respect to thelambda architecture design pattern is discussed.Third, relatedwork regardingthe evaluation and testingofIoT-basedapplications in thefog are discussed.The chapter concludes with ashort summary of the findings from theprevious sections.

3.1Distributed StreamProcessingfor theIoT and Fog Computing

Section1.1 alreadymentions that acentralizedapproachlikecloud computing is not very suitable foremerging IoTapplications like smart cities, smart manufacturingor self-drivingcars [11]. With the increase of datavelocity and datavolume, thereis the needfor anew computingparadigm to addressissues like high latency, limited bandwidth,location awarenessorprivacy concerns [80]. Placing dataanalytics at the edge of thenetwork where the data is generated and stored could be asolution for addressing aforementioned issues[12]. Furthermore, Section1.1 states thatdistributed stream processing frameworkshavebeendesignedfor cloud computingenvironments and are well established forhomogeneous nodeswithin alocal cluster [76]. However, existing platformsfallshort in providingefficientsolutions for bigdataanalytics in the fog while the demandfor processing large quantities of datainreal-timeisincreasing[12]. Section 2.3 discusses open-source state-of-the-artdistributed stream processingframeworksand big data analyticsingeneral. In this section,related work and recent researchon distributed stream processing approaches forIoT and fog computingare discussed. There exist several publicationswhichpropose the use of fog computingtoprocess data in real-time and to addressthe aforementioned issuesofIoT applications. For

25 3. RelatedWork

example, Hussain et al. present an architecture forasmart parking system anddiscuss the suitabilityoffog computing forthisscenario[81]. The architecture enables code execution on IoT devices instead of using thecloud forcomputational tasks.Fog devices are used to process and store local camerainformationabout whetherornot parking spots are free for eachpointintime.Furthermore,the collecteddataisused to predict empty parking spots in thenearfuture.Dataanalytics on thefog nodesallow for efficient real-time processingwith lowlatencycomparedtoatypical cloud-based solution [81]. Pfandzelter et al. presentbasicprocessingparadigms suitablefor IoT dataprocessing: stream processing, batchprocessing and serverless functions. Depending on differentIoT use cases (eventprocessing or dataanalytics) the paper discusseswheretorun (edge, fog or cloud computing)eachuse case and describes which processing paradigm should be used[11]. They argue that it is aquestion of pricing whether to use fog/edge computing or thecloud.Cloud computing is cheaper for complextasksonlarge datasetsand therefore moresuitable in thisapplication area. On thecontrary,network bandwidth couldbeexpensiveand limited. They propose that some preprocessing tasks like filtering, transformation, encoding, or aggregation couldbedone in thefog to save bandwidth and to usethe cloud for further processing [11]. Yannuzzietal. showthatasmart combination of fog and cloud computingisthe most plausible solutionfor building an adaptable and scalable platform for the IoT [82]. The authorsarguethat foginfrastructure is very suitable to process data especially if devices move and apredictablelatency is required. Anothervery importantfactor to addresshigh latencytothe cloud is the placementof processing operators in thefog. Hiessletal. found thatthe optimization of theplacement of stream processing operators in the fogcan be averypromising approachtoimprove highlatencydelay of cloud computing[13]. The resultsshowthat the periodical optimal placementofstreamprocessing operators in thefog reducesthe response time and the cost,compared to optimizing the placementofoperators only upon initialization. Assunção et al. presentanoverviewofconceptual and architectural approacheswhich aim to place stream processingtasks closer to where thedatawas generated,toaddress currentfog-related issues [76, 83]. Yousefpour et al. giveacomprehensivesurvey about fog computingand related edge computingparadigms [80]. The authors provide adetailed overviewofcurrentfog-related literatureand categorize eachresearchpaper according to their definedtaxonomyand objectives. Theyalso provide an overviewofapplications for fog computingand mention thatcloud-based data stream processing is not able to meet the needs of geographically distributed IoT-systems [80].

26 3.1. DistributedStreamProcessing for theIoT and FogComputing

Table 3.1: Objectives to evaluate distributedstream processing approachesfor fog computing

Objective(abbr.) Description Data processing (O1) Approaches are analyzed towards their data processingcapa- bilities and according to thelayers of thelambdaarchitecture (speed-,batch- and serving-layer). Latency(O2) Approachesaddress highlatencies to thecloud by using com- putationalresourcesatthe edgeofthe network. Bandwidth (O3) Approachesaddress bandwidth limitations by using computa- tionalresources at theedgeofthe network. Reliability(O4) Approachesdiscuss the reliability of thesystems, including fault toleranceand howthe systems response to node or net- work failures. Scalability(O5) Approachesdiscuss the scalabilityand elasticityofthe systems (e.g., scalable across an arbitrarynumberofnodes, scalable acrossfog and cloud resources). Heterogeneity (O6) Approachesaddress the heterogeneous landscapeoffog com- putinginfrastructure and do not assume aspecific node type or network. Security(O7) Approachesaddress securityand privacy aspectsoffog comput- ing,suchasauthentication, authorization and dataencryption. Generality(O8) Approachesare analyzedtowards theirfieldofapplication (e.g., is the approachdesignedfor aspecific use caseorisit generally applicable). Project state (O9) Approachesare analyzedtowards theircurrentstate of the project(discontinued, concept,researchprototypeorproduct).

3.1.1 Approaches and CurrentResearch Importantapproachesand state-of-the-artresearchfor datastreamprocessing at the edge of thenetwork are discussed in thefollowing paragraphs [80, 76, 84]. Inaddition, eachapproachisanalyzed towards apredefinedset of objectives, which have been partly adapted formthe aforementioned survey presentedbyYousefpour et al. [80]. Table 3.1 describes these objectives.

Apache Edgent Formerly known as Apache Quarks, Apache Edgentisanopen-source programming model for IoT data processing [85]. It offersaruntime forsimple data analysis on edgedevices (e.g., sensors, gateway, etc.)and APIs to use established centralizedstreamprocessing frameworksfor complex tasks (partly addresses O6). Apache Edgenttries to reduce the amountofdata senttothe cloud forfurtheranalysis, by doing basic analysis on the

27 3. RelatedWork

device beforehand [85](partly addresses O1).Itisespeciallyusefulinthe contextofthe IoT, where network traffic,bandwidth or other resourcesare limited (partly addresses O2 and addresses O3).According to the documentation[85], Apache Edgentisdesignedfor specific use cases suchasanalyzing serverlogs or analyzingmachine healthonthe edge (partly addressO8). Furthermore, the documentation does not mention anyfunctionality regardingscalability, reliabilityorsecurity(does not address O4,O5and O7). Although Apache Edgentispart of theApacheIncubatorInitiative, the projectretired andisno longer being developed actively1.

QoS-aware Scheduler forApache Storm In2015, Cardellini et al. introducedapromising approachtouse Apache Storm overa fogcomputing infrastructure [10]. As has already been discussed in previoussections, Apache Storm is designed to run on commodityhardware within alocal cluster and is notvery suitablefor highlydistributed environments. In order to allow Apache Storm to run in ageographicallydistributed and dynamic (fog) environment, theauthors extended theApacheStormarchitecture and implementedacustomdistributed QoS- aware scheduler (addresses O1 and O8).Basically,they added three newcomponents: the AdaptiveScheduler,the QoSMonitor and the WorkerMonitor (see Figure 3.1) [10]. The AdaptiveScheduler runsthe QoS-aware schedulingalgorithmand assigns tasks to the right worker nodes to reduce application latenciesand traffic [10](addresses O2 and O3).The monitoringcomponents(QoSMonitorand WorkerMonitor)are usedto gather information about thesystemsuchasresourceutilization, availabilityornetwork information. Afterwards, the collectedinformation is providedtothe distributed scheduler. Furthermore, functionalityregarding scalabilityand reliabilityare providedbyApache Storm, as theapproachextends Storm (addressesO4and O5). The resultsshowthat the distributed QoS-aware scheduler outperforms thedefaultStormscheduler,improvesthe application performance andenhances the system with runtime adaptation capabilities. Janßen et al.proposedasimilarapproachfor Apache Flink[86].

VISP Anothernoteworthy approachtomention is VISP (VIenna ecosystem for elastic Stream Processing)[87, 88]. VISP is an ecosystem for elastic datastreamprocessing forthe IoT, which has been developedatTUWien. VISP is aholisticapproachwhichsupportsthe completelifecycle of designing, deploying and executing complex processing topologies [87]. It addresses thespecific needs forelastic data stream processing in the IoT,suchas location-aware deployment andthe integration of heterogeneous devices (addresses O1 and O6). Furthermore, VISP offersaMarketplace,whereuserscan upload custom operators and topologies. Topologies can be designed with abuilt-in graphical topology builder. The VISPRuntime offersanenvironmenttoexecute topologies publishedonthe VISP marketplace. The runtime canbedeployedinthe cloud or on computational resources at

1https://github.com/apache/incubator-retired-edgent,Last accessed: 2021-01-29

28 3.1. DistributedStreamProcessing for theIoT and FogComputing

Figure 3.1: Extended Apache Storm architecture (adapted from[10]) the edgeofthe network (e.g., cloudlets). VISP alsooffers ahybriddeploymentwhere differentparts are deployedondifferentgeographical locations (addressed O2 and O3). The VISP runtime takes careofmonitoringthe system, detecting failures and scaling(e.g., deploying additional operators)ifmore computational resources are needed (addresses O4 and O5).Incomparisontoother IoTstreaming platforms, VISP offersaruntime,a marketplace, atopology builder and on-premise deployment[87].

SpanEdge Sajjadetal. propose aunifying stream processing approachovercentral and near-the- edge datacenters [89]. The idea behind this approachistodistribute stream processing applications (addressesO1) across central datacenters (first tier)and second tier data centers located in thefog (partly addresses O6). TheSpanEdge programming environment allowsprogrammers to specifywhere certaintasksofanapplication should be executed. Thisallows to run latency-sensitivepartsclose to the data sources and to reduce high latencies and bandwidth consumption (addresses O2 and O3).The same concept is used to aggregate results, where aggregation tasks are executedonideally located data centers to reduce latencies,bandwidth usage and cost.Regarding the architecture,SpanEdge usesamaster-worker architecture consisting of twodifferentworker types: hub- and spoke-workers (see Figure 3.2).Hub-workers are located at central datacenters and spoke-workers at data centers on theedgeofthe network. TheSpanEdge scheduler on the master node makes surethat tasks are optimallydistributed to hub- andspoke-workers. The authorsimplementedaprototype basedonApacheStormand usedanemulated network environmenttoevaluate the prototype(addresses O8).Asthe prototype is basedonApache Storm, SpanEdge is scalable and providesfaulttoleranceand reliability guarantees (addresses O4 and O5).They showedthatSpanEdgeisabletooptimally deploydistributedstreamprocessing applications and by doing thatreduces latencies

29 3. RelatedWork

Figure 3.2: SpanEdge architecture (adapted from [89])

and bandwidthconsumption.

GeeLytics Cheng et al. propose GeeLytics, an edge analyticsplatform designed for largescale geo-distributed IoT systems, whichfocusesonefficientreal-time dataprocessingatthe edge of thenetwork and in thecloud [90] (addresses O1). Compared to existing stream processing platforms, GeeLyticssupportsdynamic topologieswhichenable lowlatency stream processing and minimize bandwidthconsumption between thecloudand IoT devices(addresses O2 and O3).GeeLyticsisdesigned as aPaaS (addresses O8),which considersthe dynamic characteristics of aheterogeneousfog environment(partly addresses O6)and currently available system resources to instantiate stream processing tasks in the right place. This allowstoprocess data close to where the data wasgenerated or where the results are consumedorinthe cloud if needed. Anotherkey feature is scalability, GeeLyticsisdesigned to scaleuptothousands of geographically distributed nodes over wide areanetworks [90](addresses O5).GeeLyticshas amaster-worker architecture, where the Topology Master refers to the masterand the IoTAgents representthe workers (see Figure 3.3).IoT agents deployed on computational resources in thecloudorinthe fogexecute tasks wrapped in Task Containers.For communication,apublish/subscribe mechanism is used. Furthermore, a Controller located in the cloud is usedtomanage the resources and componentsofthe platform.GeeLyticsisstill in the early stage facing unaddressed challengessuchastask scheduling, resourceorchestration, reliability and advanced securityfunctionality (doesnot address O4 and partlyaddresses O7).

Frontier O’Keeffe et al.propose Frontier,aresilientedge-basedstreamprocessing system for the IoT [91](addresses O1).Frontier operates at theedgeofthe network and does not rely on apermanentconnection to thecloud. It is designed forIoT devices within awireless network with unreliable network conditions (addressesO4). In Frontier,data-intensive

30 3.1. DistributedStreamProcessing for theIoT and FogComputing

Figure 3.3: GeeLytics architecture (adapted from[90])

IoTapplications are expressed as continuous data-parallel streaming queries, whichare translatedintodataflowgraphs. Thecore idea behind Frontier is to increasethe number of possible networkpaths by addingreplicasofoperators. Ahigh network pathdiversity forquery processing results in higher throughput andamoreresilient system in unreliable networks (addresses O2 and O3). Frontier dynamicallychooses thebestlocated operators within the network, basedonprocessing capabilities,queue lengths, network conditions and aback-pressure stream routing algorithm (addresses O4, O6).The prototypeof Frontier runsonRaspberry Pis, Android and Linux devices.However, the prototype implementation does not discussany scalabilityorsecurityfunctionality(does not address O5 and O7).Anexperimental evaluationonsix Raspberry Pis shows thatthe authors were able to achieve significantspeed-upsfor thetested queries [91].

The NebulaStream Platform Anotherapproachtodealwith fog-related challengeswas introduced with the Nebula- StreamPlatform [92]. The authors argue that currentsystems are not yetreadyfor the upcoming challengesofthe IoTera.Onthe one hand, cloud-based stream processing frameworkslikeApache Flink or Apache Sparkdonot makeuse of thefull capabilitiesIoT devicesoffer. Typically,sensor data must be senttoasingle server in the cloud,which results in abottleneck for alarge sensor network with millionsofactive devices.The authorsshowthat the latency increaseswith the amountofsensorssending data[92]. On the other hand, fog-basedsolution like Frontier [91]donot makeuse of cloud infrastructure and onlyscalewithin the fog. However, the authorsstate thatIoT applications of the future should be abletoexploit the capabilitiesoffog as well as cloud computing

31 3. RelatedWork

(sensor-fog-cloud environment) and as of todaytheredoesnot exist such asystem. Zeuch et al. propose theNebulaStream(NES)platform to addressthe issuesofcurrent state-of-the-artsolutions.NES advertises itself as datamanagementsystemfor wireless sensor networks whichprovidesageneral-purpose queryexecutionengine and exploits resources in the cloud and in thefog (addresses O1 and partlyaddresses O8). The authorsmention threecharacteristics that are not supportedbycurrentdata management systems: heterogeneity,unreliabilityand elasticity. The fog is ahighlyheterogeneous environmentand currentapproacheslikevirtualmachinesdonot allow to fully exploit the capabilities of the devices. Compared to other solutions,NES considers the hardware- related capabilities of eachdeviceand generates hardware-tailoredcodetoefficiently utilizeresources (addresses O6).Furthermore, hardware-tailoredcodehelps to reduce powerconsumption, whileachieving thesameperformance. Unreliabilityrefers to the changing location, latencyand throughputofIoT nodes(addresses O2 and O3).Current approachesfor load balancing, fault toleranceand correctness do notconsider thefog and cloud in combination.The same applies for elasticity, where currentsolutions do notsupport to scaleresources across differentenvironments(addresses O5). NES usesafine-grainedrecovery strategyand applies differentfailure recovery approaches on thesensor,fog and cloud layer(addresses O4).Onthe sensor layer, NES uses sensorsinclose proximity to backupfaulty sensors. LikeFrontier, NES makes use of multiple network paths to increase fault-tolerance on the foglayer. On top of that, NES addresses securityand privacy concerns by preprocessing data locally and transferring onlyauthorized or anonymized data overpublicnetworks (partly addresses O7).Likethe aforementioned approach, NES has amaster-worker architecture (see Figure3.4). The NES Topology Manager orchestrates thetopologyconsisting of workers. NESsupports dynamic topologiesand each worker registers itself on startup. The NES Deployment Manager translates an execution plan from the NES Optimizer into anodeexecution planand deploys them on theworker nodes. The NES Monitor gathersinformation about thesystem and providesittothe optimizer to improvethe placementofoperators within thetopology. Although the NES platformisstill under development, theauthors provide some early results on specific aspects. Experiments showthatNES significantly reduces theamountofdata and sensor reads,provideslow latency,increasesthroughput and decreases energy consumption [92].

FogGuru Battulga et al. presentafog computingplatform basedonApacheFlink, called Fog- Guru [93]. FogGuru aimstomakethe deploymentofapplications on fog infrastructure easier and to improve latencies,bandwidth usage, securityand energyconsumption comparedtocloud-based solutions. The platform uses alightweight publish-subscribe messaging system (MQTT)toingest IoT dataintothe system. ADockerswarm cluster is usedfor easy application orchestration in afog computingenvironment. On top of that, Apache Flink is used as the distributedstreamprocessing framework (addresses O1 andO8). Although Dockerswarm as well as Apache Flink are scalable and provide

32 3.1. DistributedStreamProcessing for theIoT and FogComputing

Figure 3.4: NES architecture (adapted from [92]) fault-tolerance, theauthors do notadequately addressthis topic in theirpaper (partly addresses O4 andO5). Fordemonstration purposes, thesetup wasdeployedonacluster consisting of fiveRaspberry Pis. The authorsmention that in theoryany device could be usedasafog node (partly addresses O6).Practicalexperiments showedthatthe prototypewas able to processstreams of datafrom IoTdevices close to real-time.Unfor- tunately,Battulga et al. do not shareany informationregardinglatencyorbandwidth improvements.However, since FogGuru is designedtorun at the edgeofthe network, it is likely,that such as system positively addresses high latency and bandwidth utilization to the cloud (addresses O2 and O3).

3.1.2Discussion Thissection summarizes and discusses thefindings of theaforementioned approaches fromSection 3.1.1. Table3.2 gives an overview of the findingsaccording to thedefined objectives in Table 3.1. The symbolsused inTable3.2 refer to whetherthe authors addressed(resp. discussed) the objective fully (), partly ( )ornot at all (). First of all,eachofthe aforementioned approaches are explicitlydesigned for stream processing and thereforeonly addressthe speed layerand notthe batchorserving layer of thelambdaarchitecture.Asfar as processingcapabilities go, all approachesallowto process and analyzedataatthe edgeofthe network to addressfog-related challengessuch as high latencyand bandwidth limitations to the cloud. Forexample, the QoS-aware scheduler for Apache Storm approach, SpanEdge and FogGuru are based on established state-of-the-artstreamprocessing frameworks, while NES and Frontier provideastream query model forwirelesssensor networks. Although theapproaches based on Apache

33 3. RelatedWork

Table 3.2: Evaluation of distributed stream processingapproachesfor fogcomputing

LA Layers Approach/Objective y it y y y tate h it lit y ilit ts il ogene wit ncy era jec ch er lab iab urit ving eed Het Sec Gen Sca Pro Sp Bat Ser Late Band Rel Apache Edgent[85] Discontinued QoS-aware Scheduler for Prototype Apache Storm [10] VISP [87] Prototype SpanEdge[89] Prototype GeeLytics[90] Prototype Frontier [91] Prototype NES [92] Prototype FogGuru [93] Prototype

Storm and Apache Flink are great fordistributedstream processing,they are not as suitableasotherapproachesfor heterogeneous IoT devices with differentcomputational capabilities.Besidesthat, Docker-based approacheslikeGeeLytics and FogGuru allow an easy integration of devices,aslong as they are abletorun Docker. The same applies to the reliability of the systems,except forApacheEdgent andGeeLytics, which do not discuss this topic.Especially, NES and Frontier are designedfor unreliable network conditions and make use of multiple network paths to increase fault tolerance. Frontier optimizes network paths by considering processing capabilities,queue lengths and network conditions, while theNebulaStream platform alsogenerates hardware-tailoredcodeto efficientlyutilize resources. According to thepublications, nearly all approachesare scalable acrossanarbitrary number of nodes. NES,VISP,SpanEdgeand GeeLytics scaleacross fog and cloud environments to exploit thecapabilities of fog as well as cloud computing. Although Frontier seems to provide some sortofscalability, the authors do notdiscuss scalabilityintheirpaper;the sameapplies to Apache Edgent. Basically, all approachesdonot sufficientlyaddress securityordonot considersecurityatall. NES activelyaddresses securityand privacy concerns by preprocessing data locally and transferring only authorized or anonymized dataoverpublic networks. GeeLytics authenticates deviceswhen they try to join the network. Regarding the projectstate,all the aforementioned approaches–except forApacheEdgent –are research prototypes and to thebest of our knowledge onlyVISP is open-source2 and accessible to thepublic to testand carry out furtherexperiments.

2https://visp-streaming.github.io,Last accessed: 2021-01-29

34 3.2. LambdaArchitecture forDistributedStreamProcessing

3.1.3 Commercial Solutions As has already been stated at thebeginning of thischapter,fog computing has been proposed as asolution to address currentissues of IoT applications.This trend hasled well-known public cloud providers likeMicrosoft, Amazon or Google to expand their servicesfor theIoT and to invest in further research anddevelopment [94, 95]. For example, Amazon offerswith AWSIoT Greengrass aservicetoextendthe AWStoedge devices[96]. Thisallowsedgedevices to filter and process datalocallyand to send only necessary data to the cloud. In the cloud, well-established stream processing frameworks can be usedfor further processing. Microsoftaswell as Google offer with Azure IoT Edge[97]and Google Cloud IoT [98]similar solutions to process and analyze dataon edge devices.

Das et al.presentsome benchmarkingresults forAWS Greengrass and MicrosoftAzure IoT Edge [99]. The authors tested end-to-endlatency,bandwidth utilization, local resource utilization and infrastructure cost for audio-to-text decoding, image recognition machine learningand sensorprocessingapplications.Theyfoundthatthe provided servicesfor edge computingare apromising alternativetocloud computing forlight computational workloads [99].

3.2LambdaArchitecture forDistributed Stream Processing

We have already talkedabout theidea behind the lambdaarchitecture design patternin Section 2.4. Thelambdaarchitecture consists of threelayers: speed layer, serving layer and batch layer. Theinterplayofthose three layers makesitpossibletoprocess massive volumesofhistoricalbatch data, whilesimultaneously using streamprocessingtohavea real-time analysis of continuousdata streams [79].

Kiranetal. presentanimplementation of thelambdaarchitecture design patternfor cost-effectivebatchand speedy big data processing [16]. The authors found thatpublic cloud solutions like Amazon WebServices (AWS) or MicrosoftAzure provide afeasible backend forbatchand real-time dataprocessing forapplication scenarios like smart cities[16]. Especially,AWS allowsfor acost-optimized implementation of thelambda architecture designpattern for batch and real-time dataprocessing.

Martín et al. propose an edge computingarchitecture forthe IoT[100]based on the λ-CoAP architecture [101]. The λ-CoAP architecture proposes theuse of cloud computing to overcomelimited processing, storage andnetworking capabilities fordatagenerated by the IoT. Asmartgateway is usedtointerconnect thelambdaarchitecture running in the cloud and heterogeneous IoT devices connected overCoAP. The proposed λ- CoAP [101]edge computing architecture extends thesmart gateway with basic processing capabilities (e.g.,datafiltering and dataaggregation)toaddress latency and bandwidth limitations [100].

35 3. RelatedWork

Figure 3.5: Lambdaarchitecture for maintenance analysis (adapted from[100])

Yamatoetal. propose alambdaarchitecture adaptation for real-time predictivemainte- nance [102]. The proposed architecturetargets machine maintenanceinsmartfactories, where IoT devices are used to collect and analyze sensor data. The authorsargue that currentsolutions have three major problems, whenIoT platforms are used for manu- facturing and maintenanceapplications: no sufficientreal-time analysis on site, high cost to collect sensor dataand send it to the cloud foranalysis, andlastbut notleast highcost to configurerules forfailure detection [102]. To addressthose issue,Yamato et al. propose amaintenance platformusing thelambdaarchitecture design pattern. Amachine learningframework for distributedstream processing (Jubatus3)isused to detect anomaliesbyanalyzingsensor dataatthe edge(speed layer). Detected failuresare communicated to maintenance applications running in thecloud. Furthermore, stored sensor dataonthe edgeissenttoadatabase in the cloud foradetailedanalysis (batch layer). Figure 3.5 showsthe proposed architecture. Darwish et al. presentthe usage of thelambdaarchitecture forreal-time bigdataanalysis in the Internet of Vehicles (IoV) [103]. IntelligentTransportationSystems (ITS) become moreand more data-intensiveand thereisanincreasing demand to process datain real-time. Sending huge amounts of datafrom geo-distributed devices to the cloud for data analyticsfaces manyissues suchasnetwork overhead, bandwidthlimitations and

3http://jubat.us/en/,Last accessed: 2021-01-29

36 3.3. FogComputing Infrastructure highlatencies. Forlatency-sensitive applications like transport system, fogcomputing is considered apromising approachtoaddressthe aforementionedissues.However, fog computingaloneisnot sufficientdue to computation and storage limitations. Both, the cloud and thefog should be usedinsuchanenvironment. In order to analyzehistoric and new data efficiently, thelambdaarchitecture is usedtoavoid highcostsand highlatencies fromlong-runningqueries.The authorspropose areal-time intelligenttransportation system bigdataanalytics (RITS-BDA) architecture,wherethe three differentlayersof the lambdaarchitecture are distributed acrossthe cloud and thefog [103]. Wang et al. presentthe Phi architecture,anarchitecture utilizing the lambdaarchitecture design patternfor industrial bigdataprocessing in aviationmanufacturing[104]. The authorspropose theintegration of afeedbacklooptoupdate algorithm parameter on edge computingnodes. Partsofthe architecture (inparticular Apache Kafka) are deployedon the edgeofthe network to addressshortcomings of traditional dataprocessing architecture suchashighlatenciestothe cloud. Furthermore, amicroserviceapproachisused to improve stability, flexibility and expandabilityofthe system. Krepsproposestojust use astreamingenginetodoall the data processing [105]. The author argues that implementing thelambdaarchitecture in distributed frameworkslike Storm and Hadoop is very complex and there should be aframework thatabstracts overboth thereal-time and batch framework.The higher level framework compiles down to stream processing or MapReduce jobs. An examplefor suchanframework is Summingbird4.Lin also discusses asimilar approach[106]. Sanla et al. provideacomparativeperformanceanalysis between thelambdaand kappa architecture [107]. The kappa architecture omits thebatch layerand onlyconsistsofa streaming layerfor real-time dataprocessing and aserving layer. Experiments with a data size of 3MB, 30 MB and300 MB show that thelambdaarchitecture outperforms the kappa architecture for theaccuracy test,while using double the processingtime [107].

3.3Fog Computing Infrastructure

The evaluation and testingofIoT-basedapplication prototypes can be very challeng- ing[108]. Using thetargetenvironmentfor evaluation is often not feasiblenor practical. Setting up atarget-likeenvironmentusingreal hardwarecan be very expensiveand often requires alot of domainknowledge [108]. As aresult,afirst evaluation is typically performed on asimulated or emulated IoT environment; further experiments are then executedontestbeds [80,108]. Over the last years,the simulation of IoT, fog and edge environments were under extensive research. As of todaythereexist alarge number of simulators,suchasiFogSim[109] IOTSim [110], EdgeCloudSim [111]orCooja [112]. Although suchsimulatorsare very useful to test specific parts of applications for theIoT, like network protocols or placement strategies.Thereisnoall-in-one simulatorfor adetailedend-to-end evaluation [108].

4https://github.com/twitter/summingbird,Last accessed: 2021-01-29

37 3. RelatedWork

Moreover, simulationresults maynot represent the results of areal environmentand shouldbetreated withcaution[113].

Besides simulators,emulators allow to perform experiments with realapplications [108]. There existsanumberofframeworks, which can be usedtoemulate fogcomputing environments,suchasEmufog[114], Fogbed [115]orMockFog [22]. Emulation frameworks are typicallybasedonDocker, whichemerged to be an attractive technology to emulate fog-likeenvironments [116]. Forexample,Emufogallows to run docker applications on nodes in theirtopologies and to emulatethe network between thenodes using MaxiNet5. WhileEmuFogisdesigned forlocal hostemulation,Fogbed and Mockfogsupport a distributed deployment.Inparticular,MockFog canbeused to emulate fogcomputing infrastructuresinthe cloud (Amazon EC2 or OpenStack),e.g., by using multiple EC2 instances representing different IoT devices[22].

Anotheroption to evaluate IoT-basedapplications are testbeds.Testbeds allow the evaluation of IoT applications in areal environmentand allow for an in-depth analysis. Open testbeds likeFIT IoT-LAB [117]orSmartSantander[118]emerged to provide usersaplatform to easily run experiments on real hardware withoutthe overhead of setting-upsuchenvironments themselves [108]. Although testbeds giveagoodoptionto evaluate IoT applications, testbeds are usuallyvery complex anddesignedfor specific use cases, with eachtestbedoffering differentcapabilities [108, 119]. FIT IoT-LAB consists of 2728 low-power wirelessnodes and 117 mobile robots for large-scalewirelessIoT experiments [117]. SmartSantander allowstoevaluate IoTapplications for thesmartcity domain. Itconsistsofaheterogeneousset of IoT devices deployed alloverthe cityof Santander in Spain[118].

3.4Conclusion

With the increase of data velocityand data volume, centralized solutions are no longer sufficientfor large-scale IoT applications. As of today, there existmanyapproaches that proposefog computing andespeciallydataanalytics at theedgeofthe network as a solution to addresscurrentchallenges and issues of IoT-basedapplications [80]. However, there is no established consumer-readydistributed stream processingframework for fog computing.

Section 3.1.2 already discusses currentapproachesindetail. Probably themost promising approachfor wirelesssensor networks hasbeenintroduced with the NebulaStream platform [92]. Battulga et al.presentanapproachwhichuses astate-of-the-art distributed stream processing framework (Apache Flink) fordata analytics at the edgeofthe network [93]. As thiswork focuses on theimplementation of thelambdaarchitecture overfog computinginfrastructure, ratherthan implementing thenextstreamprocessing framework forthe fog, this mightbeaninteresting approachtofollow further.

5https://maxinet.github.io/,Last accessed: 2021-01-29

38 3.4. Conclusion

To analyze historicand new data efficiently, the lambdaarchitecture design pattern has been introduced. Although thefieldofbig data is under extensiveresearch, there is –to the best of our knowledge–very little activeresearchregarding the lambda architecture design patternoverfog computinginfrastructures. Anoteworthyapproachwas introduced by Darwish et al.,who presented aconceptofusing thelambdaarchitecture forreal-time big data analysisinthe transportation domain[103]. However, there is no general implementation of thelambdaarchitecture for distributed stream processing in the fog.

39

CHAPTER 4

Requirements Analysis and Design

After thediscussion of thebackground in Chapter2and of therelated work in Chapter 3, this chapter describes therequiredfunctionalitytoaddressthe problemstatementof Chapter 1and subsequently presents the architecture of the solution approach.

4.1 Requirements

The introductionaswellasthe motivational scenario already discussed that currentstate- of-the-art streamprocessing solutions are not suitable to processdata in real-time in the context of theIoT, while alsoefficientlymaking use of historical data. Chapter3concludes that thereisthe needtoprocess datainclose proximity to where thedatawas generated, in order to addresslatencyand bandwidth limitations of currentdata stream processing approaches. However, Chapter 3also statesthat there is no established consumer-ready distributed stream processing framework for fog computingand especially, there is no general implementation regardingthe lambdaarchitecture fordistributed stream processing in the fog. Although thereisalot of currentresearchregarding distributed stream processing on theedge of the network, none ofthe approaches which have been discussed in Chapter 3, takehistoricaldataintoaccount.Inorder to implement the lambda architecture for distributed stream processing in thefog, thefollowing functional as well as non-functional requirements have been defined.

41 4. Requirements Analysis and Design

4.1.1 Functional Requirements

• Dataprocessing

–Stream processing The system should be abletoprocess data close to real-time. If possible, established distributedstream processing frameworkssuchasApacheFlink should be used(see Section 2.3.1). –Batchprocessing Besides stream processing, thesystem should be abletoprocess huge amounts of historical data. Forexample, the MapReduceprogramming modelin combination withApacheHadoop is aprominent example forthis purpose. –Operators The system has to provide state-of-the-art stream processing functionalityand hastosupport standardstreamprocessingoperators,likemapping, filtering, joining, aggregating or splitting datastreams.

• Latencyand bandwidth Anotherimportantrequirementfor real-time distributedstream processing for IoT sensor networks is to reduce the latency and bandwidth usage to thecloud. Providinglow latency is crucialinvarioususe cases, such as smartmanufacturing or transportation(seeSection 2.1.3).

• Data storage The system should be abletostore recentand historicaldata. Usually, allthe data is stored in the cloud, where alot of cheap storage is available. Using thecloud to store historical dataisperfectly fine as dataaccumulated overseveral years requires alot of space.However, recentdata should be stored as close as possibletowhere the data wasgenerated toaddresslatency and bandwidthlimitations.

4.1.2Non-Functional Requirements

• Lambdaarchitecture In order to takeadvantage of both batchand stream processing, thesystem should implement the lambdaarchitecture design pattern. Thisarchitectural styleallows to process massivevolumes of historical data, while simultaneously using stream processing to provide areal-time analysis of continuous datastreams. Furthermore, the systemshould efficiently make useofthe batch, servingand speed layers.

–Batchlayer The batch layerconsistsofastorage solution, which stores all thedata(master dataset). The storage solution should be scalable, fault-tolerantand ableto

42 4.1. Requirements

efficientlyappend new data while providingimmutability of existing data. Furthermore, the batchlayer should be abletogenerateviewsonthe master data.Asthe masterdatausuallyconsistsofhuge amounts of data, thebatch layershould be abletoprocess those batchviewsinparallelovermultiple nodes. –Servinglayer The servinglayer stores theprecomputed views from thebatch layerand must efficientlyreplaceold batchviewswithnew ones (batchwriteable). Furthermore, the database forthe serving layershould allow random reads. Random writes are not required, as thedatastoreonly replaces full batches from thebatchlayer. Likethe storage solution for thebatch layer, the database forthe serving layershould be scalableand fault-tolerant. –Speed layer The speed layerholdsthe most recentdata (real-time view) and is continuously updated with new data.The database for this layershould supportrandom reads and random writes, scalabilityand fault tolerance. Furthermore, the storage solution should be located on theedgeofthe network in close proximity to the data sources to address latency and bandwidthlimitations.

• FogComputing In thecontext of the IoT, dataisusuallygenerated by IoT devices located on the edgeofthe network. To process IoT dataonalargescale, thereisthe need to push the processing of datacloser to the edgeofthe network, where the data is generated and stored. To address this issue, fog computingshouldbeused to addresslimitations of cloud computingsuchashigh latenciesand bandwidth usage. However, the system should notonly focus on infrastructure located on theedge of thenetwork, it should alsomakeuse of resources located in the cloud, when needed.

• Reliability It has already been mentionedthat the storage solution for thespeed layershould be fault-tolerant, as thespeed layeroperates in adynamic environmentonthe edgeofthe network, where resources are not as reliableasfor exampleresources operatinginthe cloud. However, this attribute also applies to thesystem in general butespecially tothose parts whichrun in the fog.

• Scalability To process largeamounts of data, it is necessarythat the system is scalable across alarge number of nodes,inthe cloud as well as in thefog.

• Heterogeneity The solution approachshould address theheterogeneouslandscapeoffog computing infrastructureand shouldnot assumeaspecific type of node or network.

43 4. Requirements Analysis and Design

4.2Architecture

The maingoal of thearchitecture is to process data in close proximitytothe data sources to avoid high latencyand bandwidth usagetothe cloud. Insteadofsending all thedata to thecloud,process it in the cloud and sendthe resultsback, theidea is to process data streams over fog computing infrastructure in adistributedway over several fog nodes. Furthermore, the architecture implements thelambdaarchitecturedesign pattern to processmassivevolumes of historical data, while simultaneously using stream processing to provide areal-time analysis of continuous datastreams. There are anumberofwaysto buildsuchanarchitecture.Section 3.1.1already discussed howsuchapproachescan look like.However, none of theapproaches considersthe lambdaarchitecture to efficiently makeuse of historical data. In thefollowing sections, important design decisions regarding howdata is processed and thelambdaarchitecture are discussed.Section 4.2.3and Figure4.1 summarize theintendedarchitecture.

4.2.1 Distributed Stream Processing

As already mentioned, state-of-the-artstreamprocessing approachesare usually deployed in the cloud. This approachisgood formanyuse-cases, which do not rely on lowlatency and where bandwidthconsumption is not an issue. However, forapplication fieldslike smartmanufacturingortransportation(e.g.,autonomousdriving), it is crucialtoprovide lowlatency.Toovercome theseshortcomings, theintendedarchitecture processes data streams in the cloud as well as in thefog. Streamprocessinginthe cloud is usedto process data and store thedatafor later usage (e.g.,batch processing over historical data). In addition to stream processing in thecloud, data streams are alsoprocessed in close proximitytothe data sources to support use cases thatrequire lowlatency.The conclusion of thelast chapter (Section 3.4) already stated thatthis work focuses on the implementationofthe lambdaarchitecture over fog computinginfrastructure, rather thanimplementing thenextstreamprocessing framework explicitly designed forthe fog. Furthermore, recent researchshows thatusing state-of-the-artstreamprocessing frameworks overfog computinginfrastructure is apromisingapproachtofollow.

Stream Processing Frameworks

Section2.3.1 already described currentstate-of-the-art distributedstreamprocessing frameworks. According to the literature, the most well-known, discussed and compared stream processing frameworksare Apache Strom, Apache Sparkand Apache Flink [21, 120, 121]. Probably the most promising candidate forstreamprocessing over fog computing infrastructureisApache Flink, whichhas has been chosen for acoupleofreasons:

•Flink is open-source, scalable,fault-tolerant, supports ahigh throughput and can be easilyintegrated into theApacheHadoop ecosystem.

44 4.2. Architecture

• Flink is very suitable forlatency-criticalapplications, as it supports real-time stream processing out-of-the-box (e.g., compared to Apache Spark,whichuses micro-batchestoprocess data streams)

•Flink providesarichset of common stream processing operators[122, 123].

• Flink has been testedondevices with ratherlow computational resources. Battulga et al. successfully deployedaFlinktopology on aclusterconsisting of Raspberry Pis [93].

• Flink does notnecessarilyrelyonthird-partycomponents,likeApache Storm dependsonApacheZooKeeper, and is thereforemore lightweight. Forexample, Cheng et al. mention in theirpaper,that accordingtoareport by Yahoo, Apache ZooKeeper is not suitablefor alarge Apache Storm cluster with over 1000 nodes due to abottleneck of ZooKeeper[90]. However, Apache Flink can be configured in high-availabilitymode1,whichrequiresApache ZooKeeper or Kubernetes.

Looking on thepure number of somebenchmarks,ApacheFlink is abletooutperform Apache Stromand Apache Sparkonbenchmarks suchasthe YahooStreamingBench- marks (YSB) [120]. However, Spark wasable to outperform both in termsofthroughput, with theusage of longer batching intervals (micro-batches).Unfortunately,increasing the batchsize alsosignificantly increasesthe latency,whichisnot suitablefor latency-critical applications [120, 121]. Furthermore, Janßen et al. propose aschedulerfor Apache Flink to addressthe geograph- ically distributed heterogeneousresources of thefog [86]. The implementedprototype shows that heuristic algorithms, which consider QoS metrics suchasbandwidth, latency, and node capacities are significantlybetteratschedulingtasksacross heterogeneous network topologies. Although thepossible improvements stronglydependonthe network topology,suchascheduler could alsobeimplemented for thiswork in thefuture. Asimilar approachhas been introducedbyCardellini [10], which has been already discussed in Section 3.1.1.The QoS-aware Scheduler forApache Storm allowstoefficiently use Apache Stromindistributed environments.The prototypeisbased on Apache Storm v0.9.3,which has been released on Nov25, 2014. Sincethen,“an immensenumberof new features, usabilityand performance improvements have been implemented” in Storm, according to thereleasenotes2.The latest versionofStorm(2.2.0) has been released on Jun 30, 2020. Although thesource code of theQoS-aware Schedulerfor Apache Storm is available on GitHub3 and it would be possibletoimplementthe adaptedschedulerin the newestversionofStorm, it maynot be feasible sinceasimilarapproachhas been proposed for Flink and Flink provides betterperformance and usability. Furthermore, implementing aQoS-aware scheduler is arather complex and time-consumingtask.

1https://ci.apache.org/projects/flink/flink-docs-stable/deployment/ha/,Last accessed: 2021-01-29 2https://storm.apache.org/2020/06/30/storm220-released.html,Last accessed: 2021-01-29 3https://github.com/matnar/uniroma2-storm,Last accessed: 2021-01-29

45 4. Requirements Analysis and Design

Message Broker Typically, IoT devices (e.g.,sensors) publish theirdatatoamessage broker located in the cloud, whichunifiesand aggregatesthe data before it is processed. Deploying sucha messagebroker on fog computing infrastructure allows the fog-locatedstreamprocessing topology to benefitfrom lowlatencies, as thedataisnot senttothe cloud first. The cloud-located stream processing topology consumes datafrom the message broker in the fog. As themessage broker is deployed in thefog and thecloud- and fog-locatedstream processing topologiesconsume thedatafrom there, the message broker should be ableto scaleacross anumberoffog nodes to improve fault tolerance in case of anodefailure. Adistributedevent streamingplatform which meets those requirements is Apache Kafka [124]. Apache Kafka offers high throughput, scalabilityand high availability. This is achievedbyreplicating topicsacrossmultiple brokers to ensure the data is notlost in case somethinggoeswrong.AlthoughApacheKafka has been originallydesigned for the cloud, Kafka canalsobeused on fog devices [125]. Additionally,ApacheKafka integrates nicely with Apache Flink. ForMachine-to-Machine communication between devices with very limited computational resources, the MQTT4 protocolisawidely-used option.MQTT is alightweight pub- lish/subscribemessaging protocol that has been developedfor connectingremotedevices with minimal network bandwidth.Thereexist many message broker implementation which supportMQTT suchasRabbitMQ5,ActiveMQ6,HiveMQ7 or Eclipse Mosquitto8. Especially, RabbitMQ and HiveMQare widely-used message brokersfor intermachine communication [126]. LikeKafka, both canbescaled across several nodestoprovide highavailabilitybyreplicatingdata(e.g.,exchanges). Furthermore, Apache Flink offers an out-of-the-box connector to consumedatafromRabbitMQ. However, aMQTT message broker does notnecessarilyreplace Kafka and vice versa [127]. There are several use caseswhichwork best with both typesofmessage brokers. For example, Kafka is the most suitable option if datastreams needlong term message storage or consumer need to replaymessage (e.g.,incase of anodefailure). Furthermore, Kafka could be usedifseveral MQTT message broker are deployed in geo-distributed fog environments which aggregatethe sensordatafor aspecific areaand report thedata to Kafka forstreamprocessing.Insuchascenario, aMQTT message broker can act as aIoT gateway to consumeMQTT messages and subsequentlywrite the messages to Kafka [128]. To do so, theConfluent PlatformoffersaMQTT connector9 to consume MQTT messagesfrom aMQTT brokerand aMQTT proxy10 to directly consumeMQTT messages.

4http://mqtt.org/,Last accessed: 2021-01-29 5https://www.rabbitmq.com/,Last accessed: 2021-01-29 6http://activemq.apache.org/,Last accessed: 2021-01-29 7https://www.hivemq.com/hivemq/,Last accessed: 2021-01-29 8https://mosquitto.org/,Last accessed:2021-01-29 9https://docs.confluent.io/kafka-connect-mqtt/current/index.html,Last accessed:2021-01-29 10https://docs.confluent.io/platform/current/kafka-mqtt/index.html,Last accessed: 2021-01-29

46 4.2. Architecture

Kafka alsointegratesnicely with Apache Hadoop. Forexample, the ConfluentPlatform offers aHDFSconnector11 for Kafkatoexport topicsdirectly to HDFS. Thisisauseful option if Kafka is used as amessage broker for stream as well as batch processing (e.g., suitable forimplementing thelambdaarchitecture). Additionally,Kafka allowsto configure to senddata in batches to increasethroughput. In theend, the choice of the message broker stronglydepends on thespecific use case and can not be answeredingeneral. Kafka is the first choice for stream processingand offers thebest integration to third-party stream andbatchprocessing frameworks, while MQTT brokerslikeRabbitMQ or HiveMQare more suitedtoconsume sensor datafrom alarge number of nodes [129]. Forthe sack of simplicity withinthe scopeofthis thesis,aseparateMQTT brokerwill notbeusedfor sensor data, but canbeimplemented in the future, if needed.

4.2.2 Lambda Architecture As thename implies,the lambdaarchitecture is an architecture which defineshow data is processed and stored. As aresult,the architecture is tightlycoupled with adistributed stream processing framework and storage solutions,likeHDFS. Most importantly, the lambdaarchitectureisusuallyimplementedinadatacenterlocatedinthe cloud [15]. However, to addressthe latency and bandwidth requirements,itisnecessarytodeploy the speed layerinclose proximity to the data sources. Compared to atraditional implementation of thelambdaarchitecture,the intended architecture uses afault-tolerant distributed database across fog computinginfrastructure to store themost recent data (speed layer).The most importantrequirements –for eachlayerofthe lambda architecture –have already been defined at thebeginning of this chapter. In thefollowing sections, suitablesolution candidates for eachlayer arediscussed.

BatchLayer The requirements forthe batchlayerare very loose and alot of storage solutions fulfill suchrequirements.However,the simplestand most suitable storage solution for the batchlayerare filesystems. File systems areable to efficientlyappend new data as data is stored sequentially on disk.Furthermore, filesystemsenforce the immutabilityofdata by setting proper filepermissions.Unfortunately,traditionalfile systems are not scalable norare they fault-tolerant. Luckily,distributedfile systems meet theserequirements. Distributed file systems are very similartotraditionalfile systems with the difference thatdistributed file systems store the data across aclusterofnodes. In the case of a node failure, the data would still be accessible.Adistributed file system suchasHDFS is very well suited to store the master dataset [15]. As therequirements forthe batchlayer arethatloose,intheory anyfault-tolerant distributed data store could be usedasthe storage solution for thebatch layer. However,

11https://docs.confluent.io/kafka-connect-hdfs/current/index.html,Last accessed: 2021-01-29

47 4. Requirements Analysis and Design

alternatives to distributed filesystemsoften offer manythingsthe batchlayerdoesnot require andthereforeunnecessarily increase complexity[15]. ComparedtoHDFS, public cloud providers offer moreelastic and thereforecheaper to operate alternatives such as Amazon’sS3, Microsoft’sAzureBlobStorage, and Google’sCloud Storage.

Besides storingthe masterdataset,the batchlayer generates batch views for specific predefined queries. This way,itisnot necessary to process all thedataevery time a predefined query has to be resolved. In the Hadoopecosystem, thereare acouple of ways, howabatchview canbegenerated. The approachstrongly depends on theuse case andthe used framework(e.g., using MapReduce, Spark, Flink). However, if the master dataset is stored on HDFS, theMapReduceimplementation of Hadoop is one possible way to generateabatchview.Toavoid implementing MapReducejobs manually,a framework fordata warehousing on top of Hadoop suchasApache Hivecan be used[130]. HiveprovidesadeclarativeSQL-likequerylanguage, namedHiveQL.HiveQL queries are translated into MapReducejobs.The results of Hivequeries canbestoredasbatch views in the storage solution of theserving layer.

Besides Hadoop, thereexist other dataprocessing frameworks forbatch processing. Twoofthe described stream processing frameworksinSection 2.3 can also be used for batchprocessing.Those twoframeworks are Apache Spark and Apache Flink.Both are prominent alternativestoMapReduceand big data analysisingeneral. Compared to MapReduce, both alternatives are rathernew frameworksand provide better usabil- ity. However, it stronglydepends on therequirements,circumstances and third-party libraries whichframework should be used. Anotherimportantpointwhichshould not be disregarded is thetypeofdata which is processed and analyzed, as the performance can vary between differentdata setsand frameworks. However, accordingtorecentresearch Spark and Flinkare abletoexecute jobs faster than MapReduce [131].

Serving Layer

The serving layerstores the precomputed viewsfrom the batchlayer. One storage solution which meetsthe requirements forthe serving layerisApache Hbase [132]. Hbase is a fault-tolerantand highly scalable, column-oriented data store builtontop ofHDFS, which supportsrandom access (read and write). Pure HDFSisnot as suitableasHbase as in HDFSonly full blockscan be read, whichare 128 MB perdefault. Hbase is notthe onlysuitable candidate, but it is nicely integrated with Apache Hiveand the Hadoop ecosystem, which makes Apache Hbase agood fit as thestorage solution for theserving layer.

In thenext section,possible storage solutions for thespeed layerare discussed, which are technically alsosuitable candidates forthe serving layer. However, as theserving layeristightly coupled withthe batchlayer, it makesalot of sense to use relatedstorage solutions,likeHDFS and Hbase.

48 4.2. Architecture

SpeedLayer The speedlayer holdsthe most recentdata (real-time view),iscontinuously updated with new data andusually operateslikethe batchand serving layers in the cloud. To meet the requirements of this work, the storage solution should be located on theedgeofthe network in close proximity to thedatasourcesinorder to addresslatency andbandwidth limitations. Furthermore, the storage solution has to be fault-tolerantand scalable over multiple nodes as processing and storage capabilities of asingle fognodeare usually limited and nodesonthe edgeare more error-proneand not as fail-safe as nodesrunning in the cloud. is adecentralizedstructured storage system, which meets therequirementsstatedatthe beginning of this chapter [133]. It is designed to run on very large clusters with commodityhardware, wheredata is automatically replicated across thenodes within acluster. On top of that, Cassandra featuresamasterless architecture with no singlepoint of failure and an SQL-like querylanguage called CQL. Although Cassandra is not intended to run on smallservers in production, it is possible to run Cassandra on devices with limitedcomputational resources suchasRaspberry Pis [134]. Manyother scalable andfault-tolerantdata stores with random access capabilities are alsovalid and possiblecandidates for thespeed layer, suchasMongoDB,CouchDB or Redis. Forexample, MongoDBcan be setup as aclusterlikeCassandra butdoesnot feature amasterlessarchitecture and is thereforenot as fault-tolerantasCassandra (e.g., if the MongoDBprimary node is unavailable,write operationscan not be processed until anew primary node has been selected). Anotherinteresting approach–which could be implementedinthe future –has been proposed by Mayeretal. called FogStore [135]. FogStore tries to optimizedistributed data store originallydeveloped for thecloud by providingafog-aware replica placement and context-sensitivedifferential consistency.

4.2.3 Summary Figure 4.1 shows thearchitecture of thesolution approachutilizing the lambdaarchitecture designpattern. Forbatchprocessing and data analytics, theApache Hadoop stackisaperfectfit. The master datawill be stored on HDFS(batchlayer of thelambdaarchitecture)and MapReduceincombination with Hivewill be usedtocreate batchviews. MapReducehas been chosen as it is already included in theApache Hadoop stack. However, using Apache FlinkorApache Spark for batch analytics is just as valid as using plainold MapReduce. Batchviews will be stored on Hbase(servinglayerofthe lambdaarchitecture),whichis agreatfitifHDFS is used. Streamprocessing will be usedinthe cloud as well as in thefog to avoid highlatency to thecloud forlatency-sensitiveapplications. Therefore, Apache Flink will be usedinthe cloud and thefog as Flink supports real-time stream processing and provides arichset of

49 4. Requirements Analysis and Design

Figure 4.1: Architecture of thesolution approach

pre-implemented operators.Streamprocessing in the cloud willbeused to provide more flexibilityifdata need to be adapted before storing the data to HDFS. In thefog, Apache Flinkwill be usedtoprocess,analyze and store incomingdata stream in real-time.As the storage solution for the speed layerinthe fog, Apache Cassandrawill be used. Forthis work, sensordevices on theedgeofthe network will directly communicate with

50 4.2. Architecture

Apache Kafka. However, an (additional) MQTT broker could be used in the future if sensor devices or thenetwork require the usage of MQTT. Thissolution approachshows one way to implement thelambdaarchitecture overfog computing infrastructure. Depending on theuse case and further requirements, alternative frameworksand data stores mayalso be suitable. The next chapter describes howthe servicescan be efficientlydeployedinthe cloud and thefog, as well as thetechnical implementation of thelambdaarchitecture.

51

CHAPTER 5

Implementation

Chapter4already specified thegeneral architecture to addressthe problemstatement. Thischapterdiscusses howsuchanarchitecture canbeimplementedfrom amore technical pointofview.First, the hardware and software configurationoffog and cloud infrastructure are discussed.Next, it is described howfog nodescan be efficiently utilized and orchestrated. The following chapterspresentthe implementation of thelambda architecture as well as theimplementation of non-functional requirements.Lastbut not least, limitationsofthe implemented solution approachare discussed.

5.1 Infrastructure Setup

5.1.1 FogInfrastructure

Setting up atarget-likeenvironmentusingthe exact hardware canbevery challenging, expensiveand often requires alot of domainknowledge. Thus, afirst evaluation is typically performed on asimulated or emulated IoT environment(see Section 3.3).Unfortunately, simulationframeworksare notvery suitedfor end-to-endevaluation of awhole system. Emulating afog computinginfrastructure is the better option whenitcomes to end-to-end testing, as already discussed in Chapter4.Inparticular,MockFog [22]can be usedto emulatefog computinginfrastructures on public cloud providers likeAWS. MockFoguses Ansible1 scriptstobootstrapAWS EC2 instancesand connects the instancesbyaprivate network. Furthermore, MockFogdeploys anodeagentoneachinstance to manipulate the network to simulate realnetwork conditions of fogdevices. On top of that,MockFog allows to deploy applications as dockercontainers.

1Ansible is atoolfor application deployment whichenables infrastructure as code (IaC).

53 5. Implementation

Table 5.1: Hardware configurationofemulated foginfrastructure

EC2Instance t4g.small t4g.micro Architecture ARM Operating system Amazon Linux 2 Kernel 4.14.173-137.229 CPU 2vCPUs Baseline Performance/vCPU 20% 10% Memory 2GiB 1GiB Storage EBSGeneral Purpose (SSD) Network up to 5Gbps

Hardware Configuration

To host atarget-likeenvironment, multiple EC2 t4g.small instanceshavebeenused, which representdifferentIoT devices with limitedcomputational capabilities (e.g.,a Raspberry Pi 3/4). Edge devices like sensorshavebeenemulated on t4g.micro instances. A t4g.small instance offers2vCPUs,2GiB memory and up to 5Gbps network bandwidth, while a t4g.micro offersonly 1GiB memory.

Besides the t4g instancetype, AWSalsooffersthe a1 and the m6g ARM-based general purposeinstance types. The AWS t4g instance type hasbeenchosen,asitisthe cheapest option and together with the a1 instancetype, the most comparable option to aRaspberry Pi. AWS t4g instancesuse acustom-built64-bit ARM AWSGraviton2 Processor. However, the vCPUbaseline performance is much lower, as shown in Table 5.1. AWS t4g instancesare burstable instancesand accumulate CPUcredits overtime, whenoperating belowthe baseline.At4g.micro instanceearns 12 creditsper hour,a t4g.small instance earns 24 creditsper hour.One credit allowsthe instance to fully utilizeavCPU corefor 1minute.

Software Configuration

The solution approachuses containerstodeploythe requiredsoftware and tools on fog nodes.Therefore, acontainer runtime must be installed on eachfog node.This proof- of-conceptrelies on Docker (19.03.6-ce)asthe container runtime (see Section 5.2.2). Other toolsand frameworksrequired for thesetup process of the fogcomputinginfrastruc- ture are automatically installed by MockFogwith Ansible scripts. Forexample, MockFog usesthe Linux package iproute-tc and thePythontool tcconfig to manipulate the network settingssuchasbandwidth,latencyand packet-loss to simulate real network conditions of the fog.

54 5.1. Infrastructure Setup x64 vs. ARM Acrucial buildingblock forfog computingisthe support of theARM architecture platform.The fogmostlyconsistsofIoT devices with limitedcomputational resources suchassensorsand other limited devices (e.g.,Raspberry Pis, mobile phonesorembedded systems).While typical cloud infrastructure and personalcomputers are based on the x86-64 architecture, devices in thefog areoften basedonARM. Especially, battery- powereddevices oftenfeature ARM-basedprocessors,asthey are morepower-efficient compared to x86-based systems (e.g.,mobile phones). As aresult,the speed layerofthe lambdaarchitecture should be able to run on ARM-basedsystem.

5.1.2Cloud Infrastructure Nowadays, many public cloud providers exist,whichprovide cloud big data platformsfor dataanalytics.Amazon EMR(Elastic MapReduce2)isone of thoseproviders. Amazon EMR offers an easy-to-use user interface to create aclusterwith pre-configured software and frameworks for data processing.

Hardware Configuration To host the cloud infrastructure,Amazon EMRoffers several differentinstance types to choose from. However, the choice strongly depends on therequirements,workload and howfast batch jobs need to be executed. Forexample, Kaplunovichetal. achieved the best performance for their use casewithcompute-optimized instances [136]. The authorsalsomentionedthatchoosing the right instance typesrequiresexpertiseand monitoring. Krian et al. explicitly mention theusage of storage-optimized instances for data processing suchasthe i2.xlarge and d2.8xlarge [16]. Inother words, there does notexist ainstancetypewhichworks best for alluse cases. According to AWS, compute-optimized instancesare the preferable instance type for batchprocessing workloads. Forthis work, aclusterofthree c5.4xlarge (1 masterand 2core nodes) compute-optimized instanceshavebeenused. A c5.4xlarge instance offers 16 vCPUs,32GiB memory and 256 GiBofGP2 EBS storage perinstance by default. Furthermore, the cluster hasbeen configuredtoscale-outautomatically based on theworkload (withaminimum of two core/task nodes).However, as this part will notbebenchmarkedinthe evaluation, thechosen hardware does not really matterat this point.

Software Configuration Besides providing the hardware to host adata analysis backend, AWSalso offersarange of frameworksand tools for bigdataprocessing.The installation and basic configuration of theseframeworksand tools is automatically donebyAmazon whenthe cluster launches. Table 5.2 shows the software whichhas been installed on theaforementioned EMR cluster.

2https://aws.amazon.com/en/emr/,Last accessed: 2021-01-29

55 5. Implementation

Table 5.2: Software configurationofcloud infrastructure

Software Version Description Hadoop 2.10.0 Framework for distributed processing of large datasets HBase 1.4.13 Distributed,scalable database on top of HDFS ZooKeeper 3.4.14 Framework to manage nodeswithin aApacheHadoop cluster Flink 1.11.0 Framework for distributed stream processing Hive 2.3.7 Framework for datawarehousing on top of Apache Hadoop Hue 4.7.1 Dashboard fordata warehouses Oozie 5.2.0 Workflowschedulerfor Apache Hadoop

5.2 DevelopmentOperations

Chapter4already specified thegeneral architecture to addressthe problemstatement. However, it hasnot been specified howfog nodescan be efficientlyutilizedand orches- trated.

5.2.1 Containerization The waysoftware is built, shipped and maintained has changedalot in recent years. In particular,containerization has become amajor trend in software development. Containerization allowstopackage software, configuration, and all related artifacts (like dependencies etc.) into one place to run the same piece of software uniformlyonany infrastructure without theneedtosetup thesame environmentoneachresource. Using containersisespeciallyuseful in dynamicenvironments, where infrastructure needstobe horizontally scalable (e.g., adding or removing newmachinestothe pool of resources). The intended architecture presentedinChapter4heavily relies on containers, as the orchestration and deploymentofcontainerson(newly addded)resources is much easier, fasterand not as error-proneascomparedtoinstalling software and all thedependencies manually. Docker [137]isone of themost-widely used tools to createand manage containers. Docker offersanengine to run containersonany device on which the Docker daemon is installed. ContainersbasedonDockerhavebeen used to process datastreams close to real-time and to deploy the speed layerofthe lambdaarchitecture in thefog. Table 5.3 gives an overview of the usedcontainers. The user schrbr3 referstomypersonal docker repository. All containers have been built for ARM and x86-64 systems.

5.2.2 Container Orchestration Managing and orchestratingany type of infrastructure canbevery challengingand time consuming.Container orchestration is used to automate the installationand

3https://hub.docker.com/u/schrbr,Last accessed:2021-01-29

56 5.2. DevelopmentOperations

Table 5.3: Containersdeployedonfog infrastructure

Container Version Description schrbr/kafka 2.13-2.6.04 Distributed, scalable and fault-tolerant message broker zookeeper 3.6.2 Manages Kafka nodes, topics and parti- tions schrbr/flink-with-job-artifacts latest5 Distributed, scalable and fault-tolerant stream processing framework cassandra 3.11.8 Distributed, scalable and fault-tolerant datastore schrbr/dashboard latest Self-developedapplication to querythe speedand serving layers schrbr/sensor latest Self-developedapplication to simulate sensor devices configuration process of containerized applications and services. Furthermore, container orchestration platforms trytoease managementtaskssuchasscaling, monitoring, logging and debugging of containerizedapplications. As of today, there exist many container orchestration platforms from manydifferentcompanies and open-source communities. Among others, themost well-known container orchestration platforms include Kubernetes, DockerSwarm (included in Docker Engine)and /Marathon [138]. Choosing acontainer orchestration platformstrongly depends on theuse case,needs and required functionality. However, as this work does notfocus on choosingthe best container orchestration platform, DockerSwarm hasbeenused for this proof-of-concept, as it is morelightweightand providesaneasier setup process, compared to Kubernetes or Mesos. Amigration to Kubernetes –whichallows to run Dockercontainers–couldbedone in the future.

5.2.3Deployment As already mentioned, MockFoghas been used to emulateand bootstrapfog computing infrastructure on AWSEC2 instances. Although MockFogallows to run Docker containers on specified nodes, it does not support creatingaclusterwith container orchestration tools suchasDockerSwarm. Therefore, further developmentoperationtaskshavebeen implementedusing Ansible and are not based on MockFog. Docker Swarm has been used to createaclusterconsisting of thenodes and services listed in Table5.4.The nodes in Table 5.4 represent asetup to achieveafault-tolerant system with aminimalnumberofnodes (see Section 5.3and 5.4).

4ARM64/AMD64 build based on wurstmeister/kafka:2.13-2.6.0 (Kafka version: 2.6.0) 5ARM64/AMD64 build based on flink:1.11.1-scala_2.12-java11 (Flink version: 1.11.1)

57 5. Implementation

Table 5.4: Nodesoffog infrastructure

Node name Label Instance type Service stream1 stream t4g.small Flink (Jobmanager) stream2 stream t4g.small Flink (Task manager) stream3 stream t4g.small Flink (Task manager) stream4 stream t4g.small Flink (Task manager) broker1 broker t4g.small ZooKeeper broker2 broker t4g.small ZooKeeper broker3 broker t4g.small ZooKeeper broker4 broker t4g.small Kafka broker5 broker t4g.small Kafka broker6 broker t4g.small Kafka storage1 storage t4g.small Cassandra storage2 storage t4g.small Cassandra storage3 storage t4g.small Cassandra monitor monitor t4g.small Dashboard sensor1 sensor t4g.micro Sensor (Data provider) sensor2 sensor t4g.micro Sensor(Data provider) sensor3 sensor t4g.micro Sensor(Data provider) sensor4 sensor t4g.micro Sensor(Data provider)

To deploythe servicesonthe cluster,the serviceshave been defined in DockerCompose, atooltorun multi-container Dockerapplications.Afterwards, the Docker Compose configuration filehas been deployedonthe Docker Swarm cluster. Docker Swarm allows to deploycontainersglobally or on specific nodesbasedonthe name of the node or a label with whichanode has been tagged. Node tagging can be usedtoorganize nodes into groups to efficientlymanage, deployand scale services within those groups.

5.3Implementation of the LambdaArchitecture

The proposed architecture in Chapter4operates on edge, fogaswell as cloud infrastruc- ture. Figure 5.1 gives an overviewofhow thelambdaarchitecture has been implemented overthe aforementionedinfrastructure. In the followingsections, theimplementing tasks for the edge, fog andcloud layerare described.

5.3.1 Edge Layer In this context,the edgelayer referstodevices with very limited computational capabilities, whichoperate on theouter edge of the network. Usually,suchdevices aresensors, which are connected to anetwork and sendsome sortofdataperiodically. As this proof-

58 5.3. Implementation of theLambdaArchitecture

Figure 5.1: Implementation of thelambdaarchitecture for distributed stream processing in the fog of-concept does not use realhardware,sensorshavebeensimulated by software. In particular,aPythonapplication has been developed, whichsimulates asensor and sends data periodically to amessage broker operating in the foglayer. Furthermore, thePython application has been packagedasaDockercontainer (schrbr/sensor)and hasbeen deployedwith DockerSwarm on AWSEC2 t4g.micro instances.

5.3.2Fog Layer Basedonthe definitionoffog computing,fog nodes are located between thecloud and the edge of thenetwork. Thislayerconsistsofmore powerful devices compared to the edge layer, but still haslimited computational capabilities compared to servers located in the cloud layer. Fogcomputing infrastructure hosts the speed layerofthelambda architecture to process data close to real-time.

Stream Processing Apache Flink has been deployedoverfog computing infrastructure to addressthe latency requirements of theproblemstatement.Therefore, adockerizedversionofFlink’s Job- and Task-Manager (schrbr/flink-with-job-artifacts)has been deployed with DockerSwarm on AWSEC2 t4g.small instances. Forthis work, the topology consists of ajob managerand three taskmanager. The number of taskmanagers should be configured depending on theworkload of thetopology and therequiredreliability guarantees(e.g., if one task manager fails thenext one takes over).

59 5. Implementation

Whilethe task manager executestasks, thejob managerkeeps trackofthe runningtasks, schedules thenexttasksand responds to completed tasks or execution errors[7]. In case the jobortask managerfail, they will be automatically restarted. Docker Swarm can be configured to restart thecorrespondingcontainer on thesame node or anyother node within the cluster.With the usage of labels(seeTable5.4)containerscan be placedand restarted on predefined (and optimallocated) nodes. Additionally,Flink’scheckpointing functionalitycan be usedtomakestateful functions and operators fault-tolerant. Furthermore, the jobmanagercan be configured in high availabilitymode. Inthis configuration mode,one of aconfigurable number of standbyjob managers takes over leadership in case the leading jobmanagerfails. The usage of this mode requires ZooKeeper or Kubernetes,and guaranteesthatthere is no single pointoffailure [7]. As already mentioned, Apache Flink providesarichset of pre-implementedoperators suchasselection, aggregation, sorting,splitting,joining, mapping, filtering, windowing. The topology consists of common stream processing operators to provide ameaningful evaluation [122, 123]. The usedtopology will be presented together with theevaluation scenario and dataset inChapter6.

Synchronization of Batchand Real-time View After thebatch layerrecomputed thebatch viewsand the serving layerreplaced thenew batches with the old ones in the serving layer, the speed layerhas to discard its data. The batch layernotifies thespeed layerinthe fogtodiscard the data,whichisalready includedinthe batchview.Inparticular,anapplication deployed in thefog listens for MQTT messages. ThisMQTT messages should not be mixed up with sensor data. It has been already discussed that an MQTT broker for sensordatahas not been implemented within the scopeofthis thesis.The in Figure 5.1 illustrated MQTT broker (Eclipse Mosquitto) is usedtosynchronizethe batchand the real-time view,asKafka is not very suitablefor this simple publish/subscribeuse-case.After aparticular message hasbeen consumed,the application prunes the–inthe batchview included –datafrom Apache Cassandra. The easiest waytoachievethis behaviorand to discard the appropriate data, is to store the most recentdata redundantlyintwo separate real-time viewsand prune the real-time viewsalternatingly [15]. Figure 5.2 illustrates thesynchronization of batch andreal-time view. Forexample,after thefirst batchrun, the data stored in the second real-time view is discarded. The first real-time view stillconsistsofall the data of the batchviewand allnew incoming dataincluding the data whichhas been stored during the generation of thebatch view.After the second batchrun, the data stored in the first real-time view is discarded. Now, thesecond real-time view holds theappropriate data to complement the batchview.

Dashboard To demonstrate thefunctionalityofthe solution approach,adashboardapplication has been developedtoquery the speed andservinglayers. TheJavaapplication is based

60 5.3. Implementation of theLambdaArchitecture

Figure 5.2: Illustration of thesynchronization of batch andreal-time view (adapted from [15]) on Spring Boot and uses Spring Datafor Apache Cassandra to connect to thestorage solution of thespeed layeroverJDBC (Java Database Connectivity).JDBC is alsoused to query data fromthe serving layer(HBase)via Hive. Furthermore, the dashboard hosts the application logicused to synchronizethe batchand real-time viewasdescribedin Section5.3.2.

5.3.3Cloud Layer Cloud computing infrastructure hosts the batchand serving layers of thelambdaarchi- tecture.

Stream Processing Likethe speed layerinthe fog, thebatch layeruses thesame streamprocessingframework to process data streams in the cloud. Deploying an additional stream processing topology in the cloud hasthe benefitthatdatacan be written into HDFS, as Flink can access HDFSdirectly.Moreover, it providesbetter flexibility(e.g., if dataneed to be adapted) and reducesthe workload of nodesinthe fog, as thetopology running in the fog is not responsiblefor storingthe data in the cloud. Compared to stream processinginthe fog, the topologyinthe cloud stores data into HDFSinstead of Cassandra. Using thesame processing framework in the fogand the cloud hasthe benefitthatthe code base for the stream processing topologies can be shared.However,using two separatestream processing topologiesmightresult in synchronization issues between thebatch viewand real-time view,ifone topology lags behind the other,longerthan the time it takes fora new batchview to become available. Although suchascenarioisvery unlikely,itcan be addressed using amore sophisticated synchronization process.Vanhove et al. proposes to tag data as soon as it enters thesystem and to subsequentlygenerate batch and speed views basedonthe tags[139]. Whenanew tag becomes active,the data marked with the oldtag can be cleared without redundancyand informationloss.

BatchProcessing Batch processing in the cloud isusedtogenerate batchviewsfor theservinglayer of the lambdaarchitecture.Asalready discussed in Chapter 4, Hiveperiodically queries data stored on HDFSand stores the results into HBase.Todoso, –a workflowscheduler for Apache Hadoop –has been used [140]. Oozie allows to perform

61 5. Implementation

certainactions defined as aDirected Acyclic Graph (DAG). It supports severaltypes of Hadoop jobs, among othersMapReduceand Hive. The periodically executedOozie workflowrecreates thebatchviewsand publishesanMQTT notification afterwards to resetthe storage solution of thespeed layerafter abatchview hasbeen successfully stored in HBase.

5.4Implementation of Non-Functional Requirements

Thissection brieflydescribes hownon-functional requirements suchasreliability, scala- bilityand heterogeneityhavebeen addressed.

Reliability Reliabilityand fault toleranceplayacrucial roletoprovide afully functional system. Especiallyinadynamic environmentlikethe fog. Therefore, thesolution approachhas been designed to providefaulttolerance from the very beginning. All usedsoftware components and frameworks suchasApacheFlink, Apache Cassandra, Apache Kafka and Apache ZooKeeper provide fault toleranceorcan be deployedina fault-tolerantway.ApacheZooKeeper hasbeendeployedasaclusterofthree nodes, which allows to handlethe failureofone ZooKeeper node.The same applies for Apache Kafka,wherethe partitions of atopicare replicated across theKafka cluster using a replicationfactor of three. Apache Cassandra features by design amasterlessarchitecture with no single pointof failure, where each Cassandra node is identical.Cassandra usesagossipprotocolto construct acluster, make data consistentwithin the cluster and replicate data across the cluster. Depending on theclustersize,configured replication factor, writeand read consistency, Cassandra can stillexecute readand write operations,ifacluster node fails. Forexample, in a3node cluster with areplication factor of 3and aconsistencylevel of 2 (forwrite and read operations),the cluster can tolerate one failedreplica. Adding more nodes or decreasingthe consistencylevel allowsCassandra to toleratemore node failures. Furthermore, Docker swarm allowstoincrease fault tolerancebyadding more manager nodes to the swarm,whichmanage theswarm and store theswarm state. Manager nodes usethe raft consensus algorithm to manage theswarm state[141]. Based on theswarm size,there should be at least 3manager nodes to maintainaquorumfor theconsensus algorithm.

Scalability Besides fault tolerance, all used software components and frameworks provide scalabilityand canbescaledacrossfog computinginfrastructure.Inaddition, DockerSwarm offersrichcapabilities for scalingservices over alarge number of nodes. On top of that, new nodes caneasily jointhe DockerSwarm cluster (e.g.,ifnew fog nodes popup).

Heterogeneity The solution approachcan be deployedonany device as longthe device is abletorun Docker and provides moderatecomputational capabilities (at least 2

62 5.5. Limitations

GiB of memory). Furthermore, the architecture does not require aspecific microprocessor architecture and allowstorun services on x86/x64 as well as ARM-based nodes.

5.5 Limitations

Thissection discusses thelimitations of theimplemented solution approach.

Stream Processing Apache Flink has not been optimized for thegeographically distributed heterogeneousresources of thefog. As already discussed in Section 4.2.1, a QoSscheduler couldbeimplementedinthe future. Janßen et al. propose ascheduler forApacheFlink to addressthis optimization [86]. The authors showthatheuristic algorithms, which consider QoS metricssuchasbandwidth,latency, andnodecapacities aresignificantly betteratscheduling tasksacross heterogeneous network topologies.

Message Broker Thissolution approachuses Apache Kafka as amessage broker, as Kafka is the first choice for streamprocessing and offersthe best integration with third-partystreamand batchprocessing frameworks. However, an MQTT broker is better suitedtoconsume sensor datafrom averylarge number of nodes[129]. For simplicityand to provideacertainlevel of abstraction, this solution approachhas not been tested withanMQTTbroker for sensor data. Anotherlimitationofthe currentimplementation with respect to themessage broker is that the streamprocessing topology running in thecloud must be aware of theIP addressesofthe Apache Kafka servicesrunning in the fog.

Computational Resources In order to deploy this solutionapproach, fog nodesmust be abletorun Docker and should offer moderateprocessing capabilities,2ormore GiB of memory and enough storage to store themost recent data(speed layer). Thesolution approachhas alsobeen testedonnodes with only1GiB of memory.Unfortunately,not all serviceswere abletoperformwithout problems.Inparticular,Flink’s task manager service crashed several times due to the very limited amountofavailable memory.However, this stronglydepends on theused stream processingtopology.For very simple topologies, fog nodes with only 1GiB of memorymightbesufficient.Furthermore, Apache Cassandra mightneed morememory to perform optimallyinaproductionenvironment. Forthis proof-of-concept,Cassandra worked very well on nodeswith 2GiB of memory.With the increasingnumberofmore powerful IoT devices (e.g.,the latestRaspberry Pi offersup to 8GiB of memory), this should not be an issueinthe near future.

Mobility The solution approachdoesnot consider moving datasources, datasinks or othernodes that are not at astaticlocation.

AutomaticScaling The solution approachdoesnot scale resources automatically basedonthe currentworkload.This limitationalso relates to theaforementionedmobility

63 5. Implementation

of nodes. Forexample, more resources are required if more sensor devices produce or receivedatawithinaspecific area[142].

Security and Privacy The solution approachdoesnot addresssecurityand privacy aspectslikeaccess control (authentication,authorization), data encryption, data breaches, or insecure APIs [143, 144].

64 CHAPTER 6

Evaluation

Thischapterevaluates thesolutionapproachintroduced in Chapter 5. First, possible data sets forthe evaluation are discussed.Inthe subsequentsection, amotivational scenario describes howthe data settogether with thesolution approachcouldbeused in apractical example. Next, thedesign of thetestbed for theevaluation is presented. Subsequently,the streamprocessingtopology is described. After that, thebenchmark design is described and resultsare discussed. Thechapterconcludeswithashortsummary of the findings from theprevious sections.

6.1 DataSets

Finding agood and suitabledataset for aspecific usecase is no easy task. In order to createameaningful evaluation of thesolution approach, thefollowing dataset candidates are shortly discussed.

6.1.1Smart* Data Set Awell-knowndataset in academicresearchisthe “Smart* DataSet for Sustainability”1. The projectconsistsofmultipledatasets,among others the“UMassSmart* Home Data Set” and the“Smart*Microgrid Data Set” [145]. The initial smarthome dataset from 2013 consists of electricaldata (e.g., circuits, energymeters, switches) collected from sensorsofthree realhomes.The microgriddataset consists of electrical dataovera single24-hour periodfrom443 unique homes. In 2017 and 2019, thedatasetshavebeen extended by morehomes and further electricand energy-related areas (e.g.,solar panels). However, the data sets are not very well-suited to createastreamprocessing use case, where high processinglatencyshould be avoided.

1http://traces.cs.umass.edu/index.php/Smart/Smart,Last accessed: 2021-01-29

65 6. Evaluation

Table 6.1: Metadata air pollutioninSeoul

Measurement Unit Good Normal Bad Very bad

SO2 ppm 0.02 0.05 0.15 1.0 NO2 ppm 0.03 0.06 0.2 2.0 CO ppm 2.0 9.0 15.0 50.0 O3 ppm 0.03 0.09 0.15 0.5 PM10 Microgram/m3 30.0 80.0 150.0 600.0 PM2.5 Microgram/m3 15.0 35.0 75.0 500.0

6.1.2 AirPollution in Seoul Averyinteresting data set has been published on Kaggle2,whichconsistsofair pollution measurements from thecityofSeoul,SouthKorea. In this dataset,datafrom 25 stations hasbeencollectedwhich are deployedaround the cityofSeoul.Eachstationhas been equipped with sensorstomeasure sulfur dioxide (SO2), nitrogen dioxide (NO2), carbon monoxide (CO), ozone(O3)and particulate matter (PM2.5, PM10). The whole dataset consists of 647,511 measurements.Furthermore, the data setincludesmetadata, such as aclassificationofpollution measurements based on thresholds,which should notbe exceeded(seeTable 6.1).The original datahas been provided by the Seoul Metropolitan Governmentand hasbeen published on Kaggle by the user “bappekim”.

6.1.3 Feinstaub Asimilar dataset3 has been usedinadata science challenge at theUniversityofRostock, Germany.Likethe aforementioned airpollution dataset,this data setprovidessensor dataabout theair quality of cities in Germany.Unfortunately,the data setisneither well structured nor cleansed andwouldtherefore requirealot of datapreparation.

6.2Motivational Scenario

Forthe evaluation,the “Air PollutioninSeoul” dataset from Kagglehas been used,as the data setisverywell structured and is suitabletocreate scenariosfor low-latency stream processing. Onepossible scenario has already been partlydescribed in thedata set description.For example, several sensorsare deployedacross acitytomeasure the air quality (e.g., public places, subway stations) based on the differentmeasurement typesand report theirmeasurements to astreamprocessing topology located in thefog. Thetopology analyzes measurementsand informs third-partysystems if the airquality exceeds acriticalthreshold. Third-party systems can be avarietyofthingssuchasother

2https://www.kaggle.com/bappekim/air-pollution-in-seoul,Last accessed: 2021-01-29 3https://btw.informatik.uni-rostock.de/index.php/de/calls/data-science-challenge/feinstaubalarm.html, Last accessed: 2021-01-29

66 6.3. Testbed

IoT devices (e.g., signal horn),whichare alsodeployedacross thecitytoinform peopleto be carefulinaspecific areaortolaunch countermeasures. Furthermore, sensors equipped with acameramodule cansendimage dataalongwith the sensordata. As images are very sensitive information, thespeed layercan be usedtostore images without theneed to obfuscatesensitive information (e.g.,beforetransferring the data to apubliccloud providerwithout obfuscation duetoprivacy rights). Thespeed layerisreset once the data hasbeen thoroughlyobfuscated and is availableinthe cloud’s servinglayer. Anotherscenario where an air pollutiondataset can be usedissmartmanufacturing. For example, imagine asmartfactory,whichdeals with toxic substances to producegoods or other substances (e.g., energyinnuclear powerplants)and uses sensorstomonitor,adjust and optimizethe manufacturing process.The process requires low-latencies to make fast decisions basedonthe sensor values. Thespeed layerlocatedinthe fogallows to access the most recentdata. Furthermore, afog-locatedspeed layercan provide additional reliability and failure safety(e.g., in case of an Internet breakdown) as thefog mightstill be accessible.

6.3Testbed

6.3.1Scenario:Fog As already discussed in Section 5.1, MockFog[22]has been used to emulatefog com- puting infrastructuresonAWS. MockFogallows to simulate realnetwork conditions of fogdevices, suchasconnection latency/delay,available bandwidth,connection delay distribution, connection duplicate probability, connection lossprobability, connection corrupt probability and connectionreordering probability.Figure 6.1 gives an overviewof the network conditions usedfor the fogscenario. The abbreviations in Figure6.1 and 6.2 correspond to the running servicesonthe nodes: f forFlink, z forZooKeeper, k for Kafka, c forCassandra, s for sensor and m formonitor (hosts thedashboard).Latency and connection lossprobabilitybetween fog nodeshavebeenconfiguredaccording to Figure 6.1.Furthermore, all nodesare connected to aprivate network with abandwidth of 30 Mbps.The applied delaybetween two nodes is theshortestpathwithin the topology. Forexample, the delay between s1 and k2 is 9ms(5ms+3ms+1ms).However, from atechnical pointofview the topology does notindicatehow traffic is routed as all machinesare directlyconnected to the same privatenetwork. Choosing meaningfulnetwork conditions forthe fogisextremely difficult as thereare hardly anyresources to refer to.Mahmudetal. suggest 3-5 ms as amaximumnodal communication delaywithin afog cluster [146]. Badiger et al. use alatency of 0.5 ms and 1msfor networks locatedinthe fog[147]. Other network parameters as packetloss are not mentionedatall. Forthis work, latency between fog nodeshas been chosen between 1and 5ms. Related servicesare deployedinclose proximity (therefore, 1ms),while sensorspresumablyhave the highest latency to the fog(therefore, 5ms).Furthermore, the network topology will be testedwith and without apacketloss of 0.5% to themonitor node (m1).

67 6. Evaluation

f4 s 1m

1m s s

1ms 1m 1ms z3 z2 f3 f2 1m s 1m s s 1m s 1m

z1 f1 3m k c3 3 s, 0.5% 0.5% s 1m s, s 1m 1m s 3m 1m s 1ms 3ms, 0.5% 3ms, 0.5% 1ms k2 k1 m1 c1 c2

5m 5m s, 0.5% 0.5% s, s, 0.5% 0.5% 5m s, 5m

s1 s2 s3 s4

Figure6.1: Fognetwork conditions

6.3.2 Scenario: Cloud In order to compare theresultsfrom the fogscenario, anotherscenario is needed.Typically the lambdaarchitecture is implemented in thecloud and sensor dataissenttothe cloud, wherethe data is processed. Figure 6.2gives an overview of thenetwork topology used for the cloudscenario. Latency andconnection lossprobabilitybetween thecloud and sensor nodeshave been configured according to Figure6.2. Again,all nodes are connected to aprivate network. Thebandwidth to sensornodes have been limitedto30Mbps, while theothernodes are connected with 5Gbps. The infrastructure has been again bootstrappedwith MockFog, with thedifference thatnetwork limitations have only been configuredtothe sensornodes. As thetechnology stackfor thefog partofthe solution approachistechnically thesame stackthatcan be usedfor dataprocessing in the cloud, the same node setup of thefog scenario has been deployed on more powerful hardware running in the cloud. This way, the results are better comparable andthe connection delayand possiblepacketloss can stillbeconfigured with MockFog. To host the services, compute-optimized c5.xlarge EC2 instanceshave been used,whichprovide 4vCPUs,8GiB of memory and up to 10 Gbps network bandwidth. As afault-tolerant stream processing setup requires more than three nodesand one node onlyruns one service,slightly lesspowerful EC2 instanceshavebeenchosen,compared to thehardware configuration presentedinSection5.1.2. In order to specify arealistic connectionlatencytothe cloud, severalservers located in Frankfurt,Germanyand at theUSeast coast have been evaluated regarding their

68 6.4. Topology

c1 c2 k3

c3 k2

f1 k1

f2 z3

f3 z2

f4 z1 m1

45ms, 45ms, 0.5% 0.5% 0.5% 0.5% 45ms,

45ms,

s1 s2 s3 s4

Figure 6.2: Cloudnetwork conditions latency.Ithas been found that theaverage latencytoservers locatedinFrankfurt is approximately45ms(+/-10ms),while the average latencytoservers located on theUS east coast is approximately120 ms (+/- 10 ms).The tests were performed fromVienna, Austria, and included wirelessand wired connected devices. At the time of testing, no significantdifference wasfoundbetween awireless mobile Internet connection(3G/4G) and awired Internetconnection.

6.4Topology

6.4.1Overview

In thefollowing section, eachstreamprocessing operator is brieflydescribed.Figure 6.3 showsanoverview of thetopology and theimplemented operators.

69 6. Evaluation

O3 O5 O6

O1 O2 O4 O5 O6

O7 O8

Figure 6.3: Topologyoverview

6.4.2 Operators Parse Operator (O1) The parse operator receives aJSON stringand parses the stringtoaData Transfer Object (DTO)usedbydownstream operators within thetopology.This operator is the first operator in the Apache Flinktopology.

Filter Operator(O2) The topology deals with air quality sensor datafromfour differentsensor stations. The filteroperator filters one station.

Calculate Operator(O3) Thisoperatorcalculates theaverage overaone minute time span(referredtoaswindow) for eachstationand each sensor measurement. The operator creates anew DTO with the average sensor values.

MonitorOperator (O4) Thisoperatormonitorsthe CO measurementofeachstationand creates anew DTO,if the CO measurementexceedsapredefinedthreshold (see Table 6.1).Inorder to test the topologyand to generate objectsmore frequently,the current implementation outputs anew DTO ifthe CO measurementexceeds1ppm. In areal-worlddeployment, this threshold shouldbeconfiguredaccordingtoTable 6.1.

Stringify Operator (O5) Thisoperator createsaJSON representation of anypassedinput object.

NotifyOperator/Sink (O6) Thisoperatorpublishes the JSON representation of thegenerated DTOfromthe Calculate Operator and the MonitorOperator to Kafka.

70 6.5. Benchmarks

P K O1 O2 O3 O5 O6 C

Figure 6.4: Evaluation pathfor measuring theround-triptime consisting of the Calculate Operator (O3)

P K O1 O2 O4 O5 O6 C

Figure 6.5: Evaluation pathfor measuring theround-trip time consisting of the Monitor Operator (O4)

Map Operator (O7) Thisoperatormaps thetransfer objects created by the Parse Operator to aData Binding Object (DBO), whichisused to store thedatainthe real-time viewsofthe speed layer.

Store Operator/Sink (O8) Thisoperatorstores the DBO created by the MapOperator to Cassandra.

6.5 Benchmarks

The most interesting part of thesolution approach to evaluate is the stream processing topology runninginthe fogand howwellthe topology compares to atraditionalcloud approach. Aprominentindication on howwellthe solutionapproachperforms is the totalround-trip time it takes from themomentasensor sends ameasurementuntil a consumerreceivesanotification based on themeasurement. In order to measure the round-trip time, each sensor measurementhas been tagged with aUniversallyUnique Identifier (UUID).Once asensorsends themeasurement, it reportsthe sending timestamp to the dashboard.The consumer, on theother hand, reportsthe consuming timestamp to the dashboard.Toavoid clock synchronizing issues, thesensor as well as theconsumer have been deployedonthe same node. Forthis work, the round-triptimesfor two paths of thetopology introducedabove (see Figure 6.3) have been measured.Figure 6.4 and Figure6.5 giveanoverviewofthose two paths,wherethe first path includes the Calculate Operator (O3) and thesecond path includesthe MonitorOperator (O4) of theFlink topology.The usedabbreviations correspond to the abbreviations of theFlink topology and P for producer, K for Kafka and C for consumer. Measuringthe round-triptimefor thepath withthe MonitorOperator is very straight- forward,asonlyone sensormeasurementisinvolved. As soon as ameasurementexceeds the configuredthreshold, the MonitorOperator passes anotification object to the downstream Notify Operator,whichpublishes the correspondingJSON (includingthe tag of the measurement) to aKafkatopic. Aconsumerconsumes thenotification object

71 6. Evaluation

and reportsthe consuming timestamp to the dashboard,wherethe difference between thesending and consuming timestampiscalculated.The same procedure applies to the measurementofthe round-trip timefor the path consisting of the CalculateOperator,with the difference that measuring theround-triptimeinvolves severalsensor measurements overaone minute timespan. As windowshavebeen configuredtonot overlap, the round-trip timehas been measured for eachmeasurement individually.

6.5.1 EvaluationCases Besides thesetup of thetestbed and theused technologystack,the round-trip time is mostlyinfluenced by the sizeofadatapacketasensorproducesand the emission rate of thesensor.Inorder to increase the data packetsize of asensor measurement, images have been sentalong with the sensorvalues. The evaluation uses foursensors, as indicated in thenetwork topology in Figure6.1 and 6.2. Sensor s1 sends large data packets (∼ 1.25 MB), sensor s2 sends datapackets of mediumsize(∼0.36 MB), sensor s3 and sensor s4 sendsmalldatapackets (∼ 0.4 KB).Furthermore, sensor s1,sensor s2, sensor s3 and sensor s4 represent differentsensor stationsofthe “Air PollutioninSeoul” data set(see Section 6.1.2).Asaresult, the sensors do not share thesame data records with thesame measurements.The followingenumeration and Table 6.2 describe the evaluation cases.

(A)Sensors produce records every 30 seconds.

(1) The round-triptime is measured for large datapackets of sensor s1 (∼ 1.25 MB) traversingthe evaluation pathconsisting of the Calculate Operator and the Monitor Operator (seeFigures6.4 and 6.5).

(2) The round-triptime is measuredfor datapackets of mediumsizeofsensor s2 (∼ 0.36 MB)traversingthe evaluation path consistingofthe Calculate Operator and the Monitor Operator (see Figures6.4 and 6.5).

(3) The round-trip time is measured for smalldatapackets of sensor s3 (∼ 0.4 KB) traversingthe evaluation pathconsisting of the Calculate Operator and the Monitor Operator (seeFigures6.4 and 6.5). (B) Sensors produce records every second.

(1) The round-triptime is measured for large datapackets of sensor s1 (∼ 1.25 MB) traversingthe evaluation pathconsisting of the Calculate Operator and the Monitor Operator (seeFigures6.4 and 6.5).

(2) The round-triptime is measuredfor datapackets of mediumsizeofsensor s2 (∼ 0.36 MB)traversingthe evaluation path consistingofthe Calculate Operator and the Monitor Operator (see Figures6.4 and 6.5).

(3) The round-trip time is measured for smalldatapackets of sensor s3 (∼ 0.4 KB) traversingthe evaluation pathconsisting of the Calculate Operator and the Monitor Operator (seeFigures6.4 and 6.5).

72 6.5. Benchmarks

Table 6.2: Overview evaluation cases

Emission rate /Network topology 30 sec 1sec 1sec, no packetloss, 300 Mbps

ze Large (∼ 1.25 MB) A.1 B.1 C.1 Si

et Medium(∼0.36 MB) A.2 B.2 C.2 ck Small(∼0.4 KB) A.3 B.3 C.3 Pa

(C) Sensors producerecordsevery second and thenetwork topology has been adapted. The packetloss shown in Figure 6.1 and Figure6.2 has been removed. Additionally, the bandwidth rate between fog nodes(fogtopology)and between thecloud and sensor nodes(cloud topology)has been increased from 30 Mbpsto300 Mbps.

(1) The round-triptime is measured for large datapackets of sensor s1 (∼ 1.25 MB) traversingthe evaluation pathconsisting of the Calculate Operator and the MonitorOperator (see Figures 6.4 and 6.5).

(2) The round-triptime is measuredfor datapackets of mediumsizeofsensor s2 (∼ 0.36 MB)traversingthe evaluation pathconsisting of the CalculateOperator and the MonitorOperator (seeFigures6.4 and 6.5).

(3) The round-trip time is measured for smalldatapackets of sensor s3 (∼ 0.4 KB) traversingthe evaluation pathconsisting of the Calculate Operator and the MonitorOperator (see Figures 6.4 and 6.5).

In thefollowing sections, theresultsregardingthe round-trip time of theevaluation cases described aboveare discussed and compared to eachother. Section 6.5.2compares cases between differentsensor emission ratesand network settings with the same datapacket size for the MonitorOperator (e.g., A.1 vs.B.1 vs. C.1). Section6.5.3 compares cases between differentdata packetsizes at thesamesensor emission rate forthe Monitor Operator (e.g., A.1 vs.A.2 vs. A.3).Last but not least,Section 6.5.4discusses the results forthe Calculate Operator.Eachevaluationcase has been benchmarkedfor about one hour.Depending on thesensor emission rate, 50 or 100 round-triptimeswill be presented in this evaluation. Furthermore, it should be mentionedthatdue to the partially large differences in themeasured round-trip times,the following figures do not share the same axis scalefor better readability.

6.5.2Discussion of Results forthe Monitor Operator:Emission Rate and Network Topology Thissection compares casesbetween differentsensor emission ratesand network settings with thesame packetsize forthe Monitor Operator.

73 6. Evaluation

Table 6.3: Comparison of evaluation casesA.1, B.1 and C.1for the MonitorOperator: Mean,Median, St. Dev.

Mean(ms) Median (ms) St. Dev. (ms) Fog 2723 2674 467 Cloud (45 ms) 6516 6517 1356 A.1 Cloud (120 ms) 11 895 11 588 3173 Fog 2507 2507 342 1 Cloud (45 ms) 6431 6083 1768 B. Cloud (120 ms) 11 266 10 564 3269 Fog 1493 1488 71 1 Cloud (45 ms) 2823 2629 1105 C. Cloud (120 ms) 7233 6538 2900

Evaluation Cases: A.1 vs. B.1vs. C.1 The resultsofthe round-trip time of large datapackets fromone sensorstationtraversing the evaluation pathpresented in Figure 6.5 (pathconsisting of the MonitorOperator)show thatthe fogisable to outperformthe cloud significantly (seeFigures 6.6).Furthermore, it canbeconcludedthatthe sensorspeed does notinfluencethe results. In fact, the resultswith alower sensor emission rate(A.1) are slightlyworse thanthe resultswith a higher emission rate(B.1).Table 6.3 shows themean,median and standarddeviation in milliseconds. Only the adaptednetwork topology(no packetloss and 300 Mbps bandwidth rate) showsimproved round-trip times. Looking at resultsofthe cloud (with 45 ms latency) withoutpacketloss and with increased bandwidth compared to thefog with packetloss (C.1 Cloud (45 ms) vs.B.1 Fog), it can be observedthatthe cloud performs very similartothe fog, resp.the fog performs slightly better thanthe cloud. This shows thatpotential packetloss drastically decreases theperformance of stream processinginthe fog. Comparing thecloud (with 120 ms latency) without packetloss and increased bandwidth with thefog (C.1 Cloud (120ms) vs.B.1 Fog), theresults stillshowthatthe fogis superior.

Evaluation Cases:A.2 vs. B.2 vs. C.2 Thiscomparison is similartothe first comparison (A.1vs. B.1vs. C.1) with the difference that this time the sensor(sensor s2 instead of sensor s1)sends medium-sized packets (∼ 0.36 MB insteadof∼1.25 MB). Figure6.7 visualizes the resultsfor different sensor emissionrates (A.2, B.2) and theadapted network topology (C.2).The fogisable to outperform (A.2,B.2 and C.2) thecloud alsofor medium-sized datapackets. Table 6.4 shows the mean, median and standard deviation in milliseconds.

74 6.5. Benchmarks

Fog Cloud (45 ms) Cloud (120 ms) A.1 B.1

16,000 16,000

8,000 8,000 ms) T( 4,000 4,000 RT 2,000 2,000

01020304050 020406080100 Requests over time Requests over time C.1

16,000

8,000 ms)

T( 4,000 RT 2,000

020406080100 Requests over time

Figure 6.6: Comparison of evaluation casesA.1, B.1 and C.1for the MonitorOperator

Due to the lowemission rate, A.2 shows only the round-trip timesof30requests as only 30 datarecordsofthis sensor(sensor s2)did exceedthe configuredthreshold of the Monitor Operator withinabenchmarktime of more than one hour. Forthis sensor, cloud (with 45 ms latency) withoutpacketloss and with increased bandwidth compared to the fogwith packetloss(C.2 Cloud (45 ms) vs.B.2 Fog), it can be observedthatthe fogoutperforms thecloud.

Evaluation Cases: A.3 vs. B.3vs. C.3 Thiscomparison is like thefirst two comparisons with the difference that this timethe sensor (sensor s3)sends small-sizedpackets (∼ 0.4 KB).Figure 6.8 visualizes theresults fordifferent sensor emission rates (A.3, B.3) and the adapted network topology (C.3).

75 6. Evaluation

Fog Cloud (45 ms) Cloud (120 ms) A.2 B.2 8,000 16,000 8,000 4,000 4,000 ms) 2,000 T( 2,000 RT 1,000 1,000

0102030 020406080100 Requests over time Requests over time C.2

8,000

4,000

ms) 2,000 T(

RT 1,000

020406080100 Requests over time

Figure 6.7: Comparison of evaluation casesA.2, B.2 and C.2for the MonitorOperator

The fog is abletoslightly outperform (A.3,B.3)the cloud alsofor small-sized data packets (see Table 6.5).Removing thepacketloss and increasingthe bandwidth shows that the fog performs similar to the cloud (C.3), if the sensor generates datapackets of smallsize.

Looking at theresultsofthe cloud (with 45 ms latency) without packetloss and increased bandwidthcompared to the fogwith packetloss(C.3Cloud (45 ms)vs. B.3Fog), it can be observed thatthe cloud outperforms the fog. Thisshows again that potentialpacket loss drastically decreases theperformance of stream processing inthe fog,especiallyif sensorssendsmall-sizedpackets. Comparing the cloud (with 120 ms latency) without packetloss and increased bandwidthwith the fog(C.3Cloud (120ms) vs.B.3 Fog), the results still showthatthe fogperformsbetter.

76 6.5. Benchmarks

Table 6.4: Comparison of evaluation casesA.2, B.2 and C.2for the MonitorOperator: Mean, Median,St. Dev.

Mean(ms) Median (ms) St. Dev. (ms) Fog 1041 1008 249 Cloud (45 ms) 2463 2148 740 A.2 Cloud (120 ms) 4478 4123 1524 Fog 1086 887 465 2 Cloud (45 ms) 3452 2484 2409 B. Cloud (120 ms) 4767 4145 2210 Fog 593 549 143 2 Cloud (45 ms) 2786 2089 1691 C. Cloud (120 ms) 3084 2320 1742

Table 6.5: Comparison of evaluation casesA.3, B.3 and C.3for the MonitorOperator: Mean, Median,St. Dev.

Mean (ms) Median (ms) St. Dev. (ms) Fog 183 40 419 Cloud (45 ms) 209 103 473 A.3 Cloud (120 ms) 678 286 1117 Fog 336 122 417 3 Cloud (45 ms) 342 102 722 B. Cloud (120 ms) 1048 296 1532 Fog 162 46 193 3 Cloud (45 ms) 135 101 112 C. Cloud (120 ms) 463 269 354

6.5.3 Discussion of Results forthe Monitor Operator: Data Packet Size Thissection compares casesbetween differentdata packetsizes at thesamesensor emission ratefor the Monitor Operator.Although this comparison is not 100% valid, as the sensors within one emission rate/network setting category (A, B, C) do not sharethe same datarecordswith the same measurements,itgives anice indicationonhow the round-trip timesare influencedbythe sizeofthe data packets.

Evaluation Cases: A.1 vs. A.2 vs. A.3 Thiscomparison looks at thedifferencebetween the round-trip time of large-, medium- and small-sizedsensor packets (∼ 1.25 MB vs. ∼ 0.36 MB vs. ∼ 0.4 KB) from different

77 6. Evaluation

Fog Cloud (45 ms) Cloud (120 ms) A.3 B.3

6,400 6,400 3,200 3,200 1,600 1,600 800 ms) 800 400

T( 400 200 200 RT 100 100 50 50 25 25 01020304050 020406080100 Requests over time Requests over time C.3

1,600 800 400 ms)

T( 200

RT 100 50

020406080100 Requests over time

Figure 6.8: Comparison of evaluation casesA.3, B.3 and C.3for the MonitorOperator

sensor stationstraversingthe evaluation path presentedinFigure6.5 at the same sensor emission rate(every30seconds). Figure 6.9 and Table 6.6 showthe results of thecomparison. It can be observedthat the fogoutperforms the cloud for small, medium,and largedatapackets. Thebiggest difference can be observedfor larger data packets. Furthermore, it canbeobservedthatsending smalldatapackets significantly improves the round-trip time forthe fogaswellasfor both cloud scenarios. Thecloud scenarios benefit the most fromsmalldata packets. Forlarge datapackets (A.1), the round-trip timesfor thefog are approximately139% faster compared to thecloud with 45 ms latency, while for smalldatapackets (A.3), the round-trip timesfor thefog are approximately only14% faster compared to thecloud with 45 ms latency.

78 6.5. Benchmarks

Fog Cloud (45 ms) Cloud (120 ms) A.1 A.2 25,600 6,400 12,800 3,200 ms) 6,400 T( 1,600 RT 3,200 800 1,600 01020304050 0102030 Requests over time Requests over time A.3 6,400 3,200 1,600 800 ms) 400 T( 200 RT 100 50 25 01020304050 Requests over time

Figure 6.9: Comparison of evaluation casesA.1, A.2 and A.3 for the MonitorOperator

Evaluation Cases: B.1 vs. B.2vs. B.3

Thiscomparison looks at thedifferencebetween theround-trip time of asensor sending large-, medium-and small-sizeddata packets at thesame sensor emission rate(every second). Figure6.10 and Table 6.7 showthe results of thecomparison. The resultsare very similartothe comparisonbefore(A.1vs. A.2 vs.A.3), as onlythe sensoremission rate changedfrom 30 secondstoevery second.The increasedemission ratedoesnot change theresults, which showthat the fogscenario oncemore performs better thanthe cloud scenarios. Likebefore, thecloud scenariosbenefitfromsmalldata packets.

79 6. Evaluation

Table 6.6: Comparison of evaluation casesA.1, A.2 and A.3 for the MonitorOperator: Mean,Median, St. Dev.

Mean(ms) Median (ms) St. Dev. (ms) Fog 2723 2674 467 Cloud (45 ms) 6516 6517 1356 A.1 Cloud (120 ms) 11 895 11 588 3173 Fog 1041 1008 249 Cloud (45 ms) 2463 2148 740 A.2 Cloud (120 ms) 4478 4123 1524 Fog 183 40 419 Cloud (45 ms) 209 103 473 A.3 Cloud (120 ms) 678 286 1117

Table 6.7: Comparison of evaluation casesB.1, B.2and B.3 forthe MonitorOperator: Mean,Median, St. Dev.

Mean(ms) Median (ms) St. Dev. (ms) Fog 2507 2507 342 1 Cloud (45 ms) 6431 6083 1768 B. Cloud (120 ms) 11 266 10 564 3269 Fog 1086 887 465 2 Cloud (45 ms) 3452 2484 2409 B. Cloud (120 ms) 4767 4145 2210 Fog 336 122 417 3 Cloud (45 ms) 342 102 722 B. Cloud (120 ms) 1048 296 1532

Evaluation Cases: C.1vs. C.2 vs.C.3

The lastcomparison is identical with thecomparison before (B.1 vs. B.2 vs. B.3)regarding the sensoremission rate. However, this time the packet losshas been removedand the bandwidth rate has been increased. Figure 6.11 visualizes theresultsfor differentlysized sensor packets. Thefog is abletooutperform thecloud forlarge and medium-sized packets (C.1 and C.2). Sending datapackets of smallsize significantly improvesthe performance of thecloud (C.3).Inthis case, the results showverysimilarround-trip times betweenthe fog scenario and thecloud scenario with 45 ms latency; compared to the cloud scenario with 120 ms latency, the fogstill providesbetterround-triptimes. Once again,itcan be observedthat sending smallerdatapackets significantly improves the round-trip timefor the fog as well as for the cloud scenarios.

80 6.5. Benchmarks

Fog Cloud (45 ms) Cloud (120 ms) B.1 B.2 25,600 12,800

12,800 6,400

ms) 6,400 3,200 T(

RT 3,200 1,600

1,600 800 020406080100 020406080100 Requests over time Requests over time B.3

6,400 3,200 1,600

ms) 800

T( 400 200 RT 100 50 25 020406080100 Requests over time

Figure6.10: Comparison of evaluation casesB.1,B.2 and B.3for the MonitorOperator

6.5.4 Discussion of Results for the Calculate Operator Thissection compares theresultsofevaluation casesfor the CalculateOperator.Although all caseshave been benchmarked,this sectiononly presents one case regardingthe different sensor emission rate/network settings(A.2vs. B.2vs. C.2) and one case regarding the differentdata packetsizes (C.1 vs. C.2 vs. C.3). The reasonwhy onlyone case is discussed is thatthe results are very similarasthe round-trip time is mostlyinfluenced by the windowsize (timespan) of the Calculate Operator. The resultsshowthe round-trip involving the Calculate Operator,whichcalculatesthe average of sensor valuesoveraone minute time spanfor eachstationand each sensor measurement. As windows have been configuredtonot overlap and theround-triptime hasbeenmeasured for eachmeasurementindividually,the results seem kind of oddat

81 6. Evaluation

Fog Cloud (45 ms) Cloud (120 ms) C.1 C.2

12,800 6,400 3,200

ms) 6,400

T( 1,600 3,200 RT 800 1,600 400 020406080100 020406080100 Requests over time Requests over time C.3

1,600 800 400 ms)

T( 200

RT 100 50

020406080100 Requests over time

Figure6.11: Comparison of evaluation casesC.1, C.2 and C.3for the Monitor Operator

first sight andmustbetreated with some reservation.Ashas already been discussed earlier, themeasurementofthe round-trip time involves severalsensor measurements over aone minute time span. Figure 6.12 (especiallyB.2,C.1,C.2 and C.3) shows theinvolved sensor measurements of eachwindowverynicely,asthe round-trip time is decreasing with each measurement that enters the window. After aone minute windowclosesand anew window opens, theround-trip time increasesagain to thewindow size (here, 1 minuteor6∗104 ms).

Table 6.9 showsthe mean, median and standarddeviation in milliseconds. Unfortunately, theseresultsare not very meaningfulasthe round-trip time is mostlyinfluenced by when measurements of thesensor enterthe one minutewindowofthe Calculate Operator.As aresult,there is hardly anydifference between the round trip time for differentsensor

82 6.6. Summary

Table 6.8: Comparison of evaluation casesC.1,C.2 and C.3for the MonitorOperator: Mean, Median,St. Dev.

Mean(ms) Median (ms) St. Dev. (ms) Fog 1493 1488 71 1 Cloud (45 ms) 2823 2629 1105 C. Cloud (120 ms) 7233 6538 2900 Fog 593 549 143 2 Cloud (45 ms) 2786 2089 1691 C. Cloud (120 ms) 3084 2320 1742 Fog 162 46 193 3 Cloud (45 ms) 135 101 112 C. Cloud (120 ms) 463 269 354

emission rates, network settingsordata packetsizes basedonthe first 100measurements. Forthe same reason, there is alsonodifference regarding the round-trip time between the fogand cloud scenarios.

Evaluation Cases: A.2 vs. B.2vs. C.2 Figure 6.12 shows the round-trip time of medium-sized sensor packets with differentemission rates/network settings (A.2,B.2 and C.2) from one sensor (sensor s2)traversingthe evaluation pathpresented in Figure 6.4. It can be observed, the fasterthe emissionrateand the better the network topology (lowerlatency,nopacketloss),the more data recordswithin aone minute time spanare consumes and considersbyCalculate Operator of theFlink topology.

Evaluation Cases: C.1 vs. C.2 vs. C.3 Figure 6.12 shows also the round-trip time of large-, medium- and small-sizedsensor packets (C.1,C.2 and C.3) from differentsensor stationstraversingthe evaluation pathpresented in Figure 6.4. Here, it can be observed, the smaller the data packet size,the more data recordswithin aone minute time span are consumesand considers by Calculate Operator of theFlink topology.

6.6Summary

The evaluation showedthat using distributed stream processinginthe fog can be avery promising alternativecompared to traditional dataprocessing in the cloud. Overall, the results showedadecrease in the round-trip times. In Section 6.5.2, it can be observedthat increasing the emissionrateofsensorsdoesnot really influenceround-trip times. Thecauseofthis might be that thetopology is not working to its full capacity, with onlyfour sensorssending measurements everysecond. As theemission rateisalready relatively high, the solutionapproachcould be tested

83 6. Evaluation

Table 6.9: Comparison of evaluation casesA.2, B.2,C.1,C.2 and C.3for the Calculate Operator:Mean, Median, St.Dev.

Mean (ms) Median (ms) St. Dev. (ms) Fog 31 131 30 956 17 105 Cloud (45 ms) 29 657 30 689 17 587 A.2 Cloud (120 ms) 32 638 32 695 17 387 Fog 29 700 28 125 17 058 2 Cloud (45 ms) 33 252 34 328 17 898 B. Cloud (120 ms) 32 611 32 717 17 898 Fog 32 699 32 699 17 474 1 Cloud (45 ms) 32 197 30 555 17 263 C. Cloud (120 ms) 32 648 32 672 17 866 Fog 30 039 29 416 17 218 2 Cloud (45 ms) 32 738 33 948 16 381 C. Cloud (120 ms) 31 285 30 876 17 544 Fog 30 742 33 283 18 642 3 Cloud (45 ms) 30 122 30 139 17 373 C. Cloud (120 ms) 30 125 30 604 17 211

again with alargernumberofsensorsinthe future. To do this, using simulation or areal-worldtestbed mightbebetter suited,asemulatingalarge number of sensors is cost-intensive. Theadapted network topologywith no packet loss and an increased bandwidth rate of 300 Mbps improved theround-triptimessignificantly.The cloud with 45 ms latencyand the adaptednetwork topology (C.1),and the fogwith package loss (A.1)showedsimilar resultsfor datapackets of large size;for datapackets of smallsize, the cloud (C.3)performedslightly better thanthe fog(A.3, B.3). It can be concluded that potential packetlossdrastically decreases the performance of stream processing in thefog. The reason for this is that TCP detectspacketloss and retransmitslost packets to make suremessagesare received at theirdestination. In case not all datapackets must be received (e.g., fromasensor to amessage broker), it mightbeanoption to consider UDP as it does not retransmit lostdata packets. In Section 6.5.3, it can be observed that sending datapackets of smallsize helps to reduce the round-trip time forall scenariossignificantly.Here, it can be observedthatsending smallerdatapackets significantly improvesthe round-trip time for the fogaswellasfor the cloud scenarios. The cloud scenariosbenefitthe most fromsmall datapackets. There are manyother possibleparameterizations to perform furthertests, but the results already showthat using fog computingisapromisingapproachfor latency-sensitive applications.Especially if thelatencytothe cloud is over 50 ms or thedatapacketsize is quite large.

84 6.6. Summary

Fog Cloud (45 ms) Cloud (120 ms) A.2 B.2 ·104 ·104 6 6 5 5 4 4 ms) 3 3 T( 2 2 RT 1 1 0 0 020406080100 020406080100 Requests over time Requests over time C.1 C.2 ·104 ·104 6 6 5 5 4 4 ms) 3 3 T( 2 2 RT 1 1 0 0 020406080100 020406080100 Requests over time Requests over time C.3 ·104 6 5 4 ms) 3 T( 2 RT 1 0 020406080100 Requests over time

Figure 6.12: Comparison of evaluation casesA.2, B.2,C.1, C.2 and C.3for the Calculate Operator

85

CHAPTER 7

Conclusion and Future Work

The final chapter summarizes theessential findingsofthis work. First, ashort discussion outlinesthe basicideasfor implementing thelambdaarchitecture fordistributed stream processing in thefog. Subsequently, possible future work is discussed.

7.1 Discussion

With the increase of data velocityand data volume, centralized solutions are no longer sufficientfor large-scale IoT applications. As of today, there exist many approaches thatpropose fog computingand especiallydataanalytics at theedgeofthe network as asolution to addresscurrentchallenges and issues of IoT-basedapplications .For analyzingnew data streamsand historicdataefficiently, thelambdaarchitecture design pattern has been introduced. Chapter2describes currentstate-of-the-art distributed stream processing frameworks as well as thelambdaarchitecture design pattern. Themain idea of thelambdaarchitecture is to buildbig data systems as astackofthree layers: speed layer, serving layerand batchlayer. Theinterplayofthesethree layers makes it possible to process massive volumes of historical batch data while simultaneously using stream processing to have areal-time analysisofcontinuous datastreams. Thearchitecture allows to optimize latency,throughput, and fault toleranceoflong-running queriesbyprovidingaccurate data fromthe batchlayerand recentdata from the speedlayer. Chapter3discusses currentresearchregarding distributed stream processingframework and thelambdaarchitectureinthe fog. Chapter4andChapter5propose asolution approachfor implementing thelambda architecture for distributed streamprocessing in the fog. Therefore, thebasicidea behind the lambdaarchitecture and theusage of stream processing in thefog have been combined. The maingoal of thearchitecture is to process data in close proximitytothe data sources

87 7. Conclusion andFuture Work

to avoid highlatencyand bandwidth usage to thecloud. In the proposed solution approach, thespeed layerofthe lambdaarchitecture together with stream processing capabilities have been implementedonthe edgeofthe network. Thebatch and serving layerhavebeen implementedoncloud-locatedinfrastructure, as onlythe cloudprovides the capabilitiestoprocessand storebig pilesofdata.

Fordata analytics in thecloud, the Apache Hadoop stackhas been used.The masterdata sethas been stored on HDFS (batchlayer of thelambdaarchitecture) andMapReduce in combinationwith Hivehas been used to createbatch views. MapReduce has been chosen as it is already included in theApacheHadoop stack. However, using Apache FlinkorApache Spark for batch analytics is just as valid as using plainold MapReduce. Batch viewshave been stored on Hbase (serving layerofthe lambdaarchitecture), which is agreatfitifHDFSisused.

Forimplementing thespeed layerinthe fog, Apache Flink is used to process, analyzeand store incomingdatastreams in real-time. Forthis work, sensordevices on theedgeofthe network directly communicate with Apache Kafka.Asthe storage solution forthe speed layerinthe fog, Apache Cassandra has been used.This setup allowslatency-sensitive applications to avoid high latenciestothe cloud and to store themostrecent dataclose to the data sources. Furthermore, Apache Flink has alsobeen used in the cloud to provide moreflexibility if data needs to be processed before storing the dataonHDFS.

The wholesystemhas been designedasreliable, fault-tolerant, and scalableaspossible. All usedsoftware components and frameworkssuchasApache Flink, Apache Cassandra, Apache Kafka, and Apache ZooKeeper providefaulttolerance and/or can be deployedin afault-tolerantway.

In order to efficientlydeployservicesacross fog computinginfrastructure,the used software componentshavebeenpackagedinDockercontainers. Forthis proof-of-concept, Ansible together with Docker Swarm have been used to manage,orchestrate and deploy containersonnodes in thefog. To addressthe heterogeneity of the fog, all containers have been built for ARM and x86-64 systems.

This work proposes an architecturaldesign for the implementation of thelambdaarchi- tecture fordistributed stream processing over fog computinginfrastructure. Thesolution approachshows one possibleimplementation of theproposed architecture. Although all frameworks, datastores, and tools have been chosen for aparticular reason (see Chapter4), the solutionapproachdoesnot implythe mandatoryuse of thepresented software components. Depending on theuse case and further requirements,alternatives for eachlayerofthe lambdaarchitecture have been discussed in Chapter 4.

The evaluation of thesolution approachshows that using distributed stream processing in thefog can be avery promisingalternativecompared to traditional dataprocessing in the cloud. Overall, the resultsshowadecreaseinthe round-trip times. Especially if the latency to the cloud is over 50 ms or the datapacketsize is quite large.

88 7.2. Future Work

7.2 Future Work

Thiswork proposes an architectural design for theimplementation of thelambdaarchi- tecture fordistributed stream processinginthe fog and subsequentlypresents asolution approachofthe proposedarchitecture.Although the resultsalready showedadecreasein the round-trip times, the solution approach(especiallythe speed layer) could be further optimizedfor fog computing infrastructure. First of all, the stream processingtopology in thefog could be improvedwith aQoS- aware scheduler forApache Flink to addressthe geographically distributed heterogeneous resources of thefog. Besides improving Apache Flink for thefog, thesolution approach could alsobetested with other streamprocessing frameworks suchasApacheSpark in combinationwith the new experimental streamingexecutionmode“Continuous processing”1 introduced in Spark 2.3. Forsimplicityand to provide acertainlevel of abstraction, thesolution approachhas notbeentested with an MQTT broker for sensor data. Although Apache Kafka is the first choice for stream processing andoffersthe best integration to third-party stream and batch processing frameworks, an MQTT broker is better suitedtoconsume sensor data fromalargenumberofnodes.Inthe future, an MQTT broker like RabbitMQ or HiveMQcouldbetested in combination with Apache Kafka or in replacement of Apache Kafka. In addition to theusage of an MQTT message broker,the whole solution approach should be testedagainwith averylarge number of sensor devices.Asthe evaluation in Chapter 6already stated, thesolution approachwas not working to its full capacity with onlyfour sensorssending measurements every second.Inthe future, simulation or areal-worldtestbed could be usedtoevaluate thecomputational boundaries and performance limits of thesolution approach. The solution approachsuggests the use of Ansible,containers, and acontainer orchestra- tionplatform to efficientlydeployservices across fog computing infrastructure.For this proof-of-concept, DockerSwarm has been used, as it is lightweightand provides an easy setup process. In the future, othercontainerorchestrationplatform could be testedsuch as Kubernetes or morelightweight versions of Kubernetes (e.g., K3s2 or KubeEdge3). Furthermore, the containersthemselves could be improvedtobeasslim, secure,and performant as possible (e.g.,optimizingthe base image of containers). Regarding the lambda architecture, amore sophisticatedsynchronization process between the batchand speed layercould be implemented to avoid redundancy (and information loss)between thebatch viewsand the real-time views. To achievethis, the real-time view must be strictly aware of which datarecordsare includedinthe batchview and vice versa. Suchanapproachhas been already brieflydescribedinSection 5.3. 1https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html# continuous-processing,Lastaccessed:2021-01-29 2https://k3s.io/,Lastaccessed:2021-01-29 3https://kubeedge.io/,Lastaccessed:2021-01-29

89 7. Conclusion andFuture Work

The storage solution of thespeed layercouldbeimproved forthe fogbyimplementing afog-aware placementstrategyfor replicas and acontext-sensitiveconsistencymodel (e.g., depending on thelocation of IoT devices,not all devices need thesame consistency guarantees at thesame time).Moreover, the speed layercouldbetested with other distributed and fault-tolerantstorage solutions such as MongoDB, CouchDB, or Redis. Last but not least, it should be emphasized that with theemergence of thelatest mobile communication standard(5G)and the ever-increasing number of morepowerful IoT devices(e.g.,the latestRaspberry Pi is available with 8GiB of memory), it remains to be seenhow thefog willevolveinthe future.

90 ListofFigures

2.1 Prominentprotocols forthe IoT(adapted from [32])...... 10 2.2 Fogcomputing environment(adapted from [47])...... 12 2.3 Apache Storm topology (adapted from [6])...... 17 2.4 Apache Storm cluster (adapted from[75]) ...... 18 2.5 Apache Heron topology architecture (adapted from [8])...... 19 2.6 Apache Spark Streaming (adapted from [9])...... 19 2.7 Lambdaarchitecture (adapted from[15]) ...... 21

3.1 Extended Apache Storm architecture (adapted from[10]) ...... 29 3.2 SpanEdge architecture (adapted from [89])...... 30 3.3 GeeLytics architecture(adapted from [90]) ...... 31 3.4 NES architecture (adapted from [92])...... 33 3.5 Lambdaarchitecture formaintenance analysis (adaptedfrom[100]). ...36

4.1 Architecture of thesolution approach...... 50

5.1 Implementation of thelambdaarchitecture fordistributed stream processing in the fog...... 59 5.2 Illustration of thesynchronization of batch andreal-time view (adapted from [15])...... 61

6.1 Fognetwork conditions ...... 68 6.2 Cloud network conditions ...... 69 6.3 Topologyoverview...... 70 6.4 Evaluation pathfor measuring theround-trip time consisting of the Calculate Operator (O3) ...... 71 6.5 Evaluation pathfor measuring theround-trip time consisting of the Monitor Operator (O4) ...... 71 6.6 Comparison of evaluation cases A.1, B.1and C.1 forthe MonitorOperator 75 6.7 Comparison of evaluation cases A.2, B.2and C.2 forthe MonitorOperator 76 6.8 Comparison of evaluation cases A.3, B.3and C.3 forthe MonitorOperator 78 6.9 Comparison of evaluation cases A.1, A.2 and A.3 for the MonitorOperator 79 6.10 Comparison of evaluation cases B.1, B.2and B.3 for the MonitorOperator 81 6.11 Comparison of evaluation cases C.1, C.2and C.3 forthe MonitorOperator 82

91 6.12 Comparison of evaluation casesA.2, B.2,C.1, C.2 and C.3for the Calculate Operator ...... 85

92 List of Tables

3.1 Objectives to evaluate distributed stream processingapproaches for fog com- puting...... 27 3.2 Evaluation of distributed stream processing approachesfor fog computing 34

5.1 Hardware configuration of emulated fog infrastructure ...... 54 5.2 Software configuration of cloudinfrastructure ...... 56 5.3 Containers deployed on foginfrastructure ...... 57 5.4 Nodesoffog infrastructure ...... 58

6.1 Metadata air pollution in Seoul ...... 66 6.2 Overviewevaluation cases ...... 73 6.3 Comparison of evaluation casesA.1, B.1 and C.1for the MonitorOperator: Mean, Median, St.Dev...... 74 6.4 Comparison of evaluation casesA.2, B.2 and C.2for the MonitorOperator: Mean, Median, St.Dev...... 77 6.5 Comparison of evaluation casesA.3, B.3 and C.3for the MonitorOperator: Mean, Median, St.Dev...... 77 6.6 Comparison of evaluation casesA.1, A.2 and A.3 for the MonitorOperator: Mean, Median, St.Dev...... 80 6.7 Comparison of evaluation casesB.1,B.2 and B.3for the MonitorOperator: Mean, Median, St.Dev...... 80 6.8 Comparison of evaluation casesC.1, C.2 and C.3for the MonitorOperator: Mean, Median, St.Dev...... 83 6.9 Comparison of evaluation casesA.2, B.2,C.1, C.2 and C.3for the Calculate Operator:Mean, Median, St. Dev...... 84

93

Acronyms

6LoWPAN IPv6overLow-PowerWireless PersonalArea Networks. 8, 9

AMQP AdvancedMessage Queuing Protocol. 9

AWS Amazon WebServices. 35, 53–55, 57, 59, 67

CoAP ConstrainedApplication Protocol. 9, 35

DAG Directed Acyclic Graph. 62

DBO Data Binding Object. 71

DTO Data Transfer Object.70

EMR Elastic MapReduce. 55

EPC Electronic ProductCode. 8

ETSI European TelecommunicationsStandardsInstitute.9

HDFS Hadoop DistributedFile System. 16, 20, 47–50, 61, 88

IEEE InstituteofElectricaland Electronics Engineers. 9

IETF Internet Engineering Task Force. 9

IoT Internet of Things.xi, xiii, 1, 2, 5, 7–15, 25–38, 41–43, 46, 53–55, 63, 67, 87, 90

IoV Internet of Vehicles.36

ITS IntelligentTransportationSystems. 36

ITU International Telecommunication Union. 7, 8

JDBC Java Database Connectivity. 61

95 LoRaWan Long Range Wide Area Network. 9, 13 LR-WPAN Low-RateWirelessPersonalArea Network. 9

M2M Machine-to-Machine.10, 46 MEC Mobile Edge Computing.14 MQTT Message Queue Telemetry Transport. 9, 32, 46, 47, 51, 60, 62, 63, 89

NES NebulaStream. 32–34 NFV Network Function Virtualisation. 14 NIST National InstituteofStandardsand Technology.11

OWL WebOntology Language. 10

QoS QualityofService.28, 33, 45, 63, 89

RDD ResilientDistributedDataset. 19 RDF Resource Description Framework.10 REST Representational State Transfer. 9 RFID Radio-Frequency Identification.7–9, 11

SDN Software Defined Network. 14 SoC System-on-a-Chip.14 SPADE The System SDeclarativeStream Processing Engine. 16

UUID Universally Unique Identifier. 71

V2I Vehicle-to-Infrastructure. 11 V2I2V Vehicle-to-Infrastructure-to-Vehicle.11 V2V Vehicle-to-Vehicle.11 VISP VIenna ecosystem for elastic Stream Processing. 28, 29, 34 VM Virtual Machine. 14

W3C World Wide WebConsortium. 9

YARN YetAnotherResource Negotiator. 20 YSB YahooStreamingBenchmarks.45

96 Bibliography

[1] MinChen,Shiwen Mao,and YunhaoLiu. Big Data:ASurvey. Mobile networks and applications,2014.

[2] Jeffrey Deanand Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communicationsofthe ACM,2008.

[3] Aditya BPatel, Manashvi Birla, and Ushma Nair.Addressing Big Data Problem Using Hadoop and Map Reduce. In NirmaUniversity International Conferenceon Engineering.IEEE, 2012.

[4] HamidNasiri,Saeed Nasehi,and Maziar Goudarzi.Evaluation of Distributed StreamProcessing Frameworksfor IoT Applications in SmartCities. Journal of Big Data,2019.

[5] Haruna Isah,Tariq Abughofa, Sazia Mahfuz,Dharmitha Ajerla, Farhana Zulkernine, and Shahzad Khan. ASurvey of DistributedDataStreamProcessing Frameworks. IEEE Access,2019.

[6] Apache Storm Documentation. https://storm.apache.org/releases/2.2.0/Tutorial.html. Last accessed:2021-01-29.

[7] Apache Flink Documentation. https://ci.apache.org/projects/flink/ flink-docs-release-1.12.Last accessed: 2021-01-29.

[8] Sanjeev Kulkarni, NikunjBhagat,MaosongFu, Vikas Kedigehalli, Christopher Kellogg, SaileshMittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. Twitter Heron:StreamProcessing at Scale. In ACMSIGMOD International ConferenceonManagement of Data.Associationfor ComputingMachinery,2015.

[9] Apache Spark Documentation. https://spark.apache.org.Last accessed: 2021-01-29.

[10] ValeriaCardellini,Vincenzo Grassi, FrancescoLoPresti,and MatteoNardelli. On QoS-Aware Scheduling of DataStreamApplications overFog Computing Infrastructures. In IEEE Symposium on Computersand Communications,2016.

97 [11] TobiasPfandzelter and DavidBermbach. IoT DataProcessing in the Fog: Functions, Streams, or Batch Processing?InIEEE InternationalConferenceonFog Computing, 2019.

[12] Farahd Mehdipour, Bahman Javadi, and Aniket Mahanti.FOG-Engine: Towards BigDataAnalyticsinthe Fog. In IEEE 14th InternationalConferenceonDepend- able, Autonomic and Secure Computing,2016.

[13] Thomas Hiessl, Vasileios Karagiannis, Christoph Hochreiner, StefanSchulte, and Matteo Nardelli.OptimalPlacementofStreamProcessing Operators in the Fog. In IEEE3rd International ConferenceonFog and EdgeComputing,2019.

[14] Hooman Peiro Sajjad, Ken Danniswara, Ahmad Al-Shishtawy,and Vladimir Vlassov. SpanEdge: Towards Unifying Stream Processing over Central and Near-the-edge Data Centers.In1stIEEE/ACM Symposium on Edge Computing,2016.

[15] Nathan Marzand JamesWarren. Big Data: Principles and BestPractices of ScalableRealtime Data Systems.Manning Publications Co., 2015.

[16] Mariam Kiran, Peter Murphy, InderMonga, Jon Dugan, and SartajSinghBaveja. LambdaArchitecture forCost-effectiveBatchand Speed Big Data Processing. In IEEE InternationalConferenceonBig Data,2015.

[17] Flavio Bonomi, Rodolfo Milito, Jiang Zhu, and SateeshAddepalli. FogComputing and itsRole in theInternet of Things. In 1stACM Mobile Cloud Computing Workshop,2012.

[18] Roel Wieringa. Design ScienceMethodology: Principles and Practice. In 32nd ACM/IEEEInternational ConferenceonSoftwareEngineering,2010.

[19] Roel Wieringa. Design Science Methodology: ForInformation Systemsand Software Engineering.Springer, 2014.

[20] Alan RHevner,Salvatore TMarch, JinsooPark,and Sudha Ram. DesignScience in Information SystemsResearch. MIS Quarterly: Management InformationSystems, 2004.

[21] Jeyhun Karimov, Tilmann Rabl,Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. BenchmarkingDistributedStreamDataProcessing Systems.InIEEE 34th InternationalConferenceonData Engineering (ICDE). IEEE, 2018.

[22] JonathanHasenburg,Martin Grambow, Elias Grunewald,Sascha Huk,and David Bermbach. MockFog: EmulatingFog Computing Infrastructure in the Cloud.In 1st IEEE International ConferenceonFog Computing.IEEE, 2019.

[23] RajJain. TheArt of Computer SystemsPerformanceAnalysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling.Wiley,1991.

98 [24] Ammar Rayesand Samer Salam. InternetofThings from Hype to Reality.Springer, 2019. [25] InternationalTelecommunication Union.ITU Internet Reports. The Internet of Things. 2005. [26] FarshadFirouzi, Krishnendu Chakrabarty, and SaniNassif. Intelligent Internet of Things: From DevicetoFog and Cloud.Springer, 2020. [27] International TelecommunicationUnion. Overview of the Internet of Things.2012. [28] RoyWant, Bill NSchilit, and Scott Jenson.Enabling the Internet of Things. Computer,2015. [29] Luigi Atzori, Antonio Iera, and Giacomo Morabito. The Internet of Things:A Survey. Computer Networks,2010. [30] Ala Al-Fuqaha, Mohsen Guizani,Mehdi Mohammadi,MohammedAledhari,and Moussa Ayyash. Internet of Things:ASurvey on Enabling Technologies, Protocols, and Applications. IEEE Communications Surveys and Tutorials,2015. [31] Jayavardhana Gubbi, Rajkumar Buyya,SlavenMarusic, andMarimuthu Palaniswami.Internet of Things (IoT):AVision, ArchitecturalElements, and Future Directions. Future Generation Computer Systems,2013. [32] Shadi Al-Sarawi, MohammedAnbar, Kamal Alieyan,and MahmoodAlzubaidi. Internet of Things (IoT) Communication Protocols. In ICIT 2017 -8th International ConferenceonInformation Technology.IEEE, 2017. [33] MoeenHassanalieragh,Alex Page, Tolga Soyata, GauravSharma,Mehmet Aktas, Gonzalo Mateos, Burak Kantarci, and Silvana Andreescu.Health Monitoringand ManagementUsing Internet-of-Things (IoT) Sensing with Cloud-BasedProcessing: Opportunities and Challenges. In 2015IEEE International ConferenceonServices Computing.IEEE, 2015. [34] Tara Salman and RajJain.Networking Protocols and Standards for Internet of Things. Internet of Things and Data Analytics Handbook,2015. [35] MatthewGigli, Simon Koo, et al. InternetofThings:Servicesand Applications Categorization. Advances in Internet of Things,2011. [36] PaiZheng, Zhiqian Sang, RayYZhong, Yongkui Liu,Chao Liu, KhamdiMubarok, Shiqiang Yu,Xun Xu, et al. SmartManufacturingSystems forIndustry 4.0: Conceptual Framework,Scenarios,and Future Perspectives. Frontiers of Mechanical Engineering,2018. [37] Dongpu Cao, Li Li, Clara Marina, Long Chen, Yang Xing, and Weihua Zhuang. SpecialIssueonInternet of Things forConnected Automated Driving. IEEE InternetofThings Journal,2020.

99 [38] FrancescoBellotti,RiccardoBerta, Ahmad Kobeissi, Nisrine Osman, Eduardo Arnold, Mehrdad Dianati, Ben Nagy,and Alessandro De Gloria. Designing an IoT Framework for Automated Driving ImpactAnalysis. In IEEE Intelligent Vehicles Symposium.IEEE, 2019.

[39] Aditya Gaur,Bryan Scotney,Gerard Parr, and Sally McClean. Smart City Archi- tecture and itsApplications based on IoT. Procedia Computer Science,2015.

[40] Yin Jie, Ji Yong Pei, Li Jun, GuoYun, and Xu Wei. Smart Home System based on IoT Technologies. In InternationalConferenceonComputational and Information Sciences.IEEE, 2013.

[41] KA Patil and NR Kale. AModel for SmartAgriculture Using IoT. In International ConferenceonGlobalTrends in Signal Processing, Information Computing and Communication.IEEE, 2016.

[42] Miao Yunand Bu Yuxin. Research on the Architecture and Key Technology of Internet of Things (IoT) Applied on SmartGrid. In InternationalConferenceon Advances in Energy Engineering.IEEE,2010.

[43] DanielMinoli, Kazem Sohraby, and Benedict Occhiogrosso. IoT Considerations, Requirements,and Architecturesfor Smart Buildings-Energy Optimization and Next-Generation BuildingManagementSystems. IEEE InternetofThings Journal, 2017.

[44] Sara Amendola, Rossella Lodato, SabinaManzari, CeciliaOcchiuzzi,and Gaetano Marrocco. RFID Technology for IoT-basedPersonalHealthcareinSmartSpaces. IEEE Internet of ThingsJournal,2014.

[45] Bahar Farahani, FarshadFirouzi, VictorChang,Mustafa Badaroglu, Nicholas Constant, and Kunal Mankodiya. Towards Fog-driven IoT eHealth:Promises andChallenges of IoT in Medicine and Healthcare. FutureGeneration Computer Systems,2018.

[46] Peter Mell andTim Grance.The NIST Definition of Cloud Computing. 2011.

[47] Amir VahidDastjerdi andRajkumar Buyya.Fog Computing: Helping theInternet of Things Realizeits Potential. Computer,2016.

[48] LuisMVaqueroand LuisRodero-Merino. Finding your Wayinthe Fog: Towards aComprehensiveDefinitionofFog Computing. ACMSIGCOMM Computer CommunicationReview,2014.

[49] Mohammad Aazam, SheraliZeadally,and Khaled AHarras. Offloading in Fog Computingfor IoT: Review, Enabling Technologies, and Research Opportunities. FutureGeneration Computer Systems,2018.

100 [50] Rajkumar Buyya and Amir Vahid Dastjerdi. InternetofThings:Principlesand Paradigms.Elsevier, 2016.

[51] Shancang Li, Li Da Xu, and Shanshan Zhao. 5G Internet of Things:Asurvey. JournalofIndustrial Information Integration,2018.

[52] Philipp Schulz, MaximilianMatthe, HenrikKlessig,MeryemSimsek,Gerhard Fettweis, Junaid Ansari,Shehzad AliAshraf,Bjoern Almeroth,Jens Voigt, Ines Riedel, et al. Latency Critical IoT applications in 5G: Perspectiveonthe Designof RadioInterface and Network Architecture. IEEE Communications Magazine,2017.

[53] Nicolas Sornin, MiguelLuis,Thomas Eirich, Thorsten Kramp, and Olivier Hersent. LorawanSpecification. LoRa Alliance,2015.

[54] FerranAdelantado, Xavier Vilajosana, Pere Tuset-Peiro, Borja Martinez, Joan Melia-Segui, and Thomas Watteyne. Understandingthe LimitsofLoRaWAN. IEEECommunicationsMagazine,2017.

[55] Kashif Bilal, Osman Khalid, Aiman Erbad, and SameeUKhan. Potentials, Trends, and ProspectsinEdge Technologies: Fog, Cloudlet, MobileEdge,and MicroData Centers. ComputerNetworks,2018.

[56] Koustabh Dolui and Soumya KantiDatta. Comparison of Edge ComputingIm- plementations: FogComputing, Cloudlet and Mobile Edge Computing. In Global Internet of Things Summit.IEEE,2017.

[57] Karim Habak,Mostafa Ammar,Khaled AHarras, and Ellen Zegura. FemtoClouds: Leveraging MobileDevices to Provide Cloud Service at theEdge.InIEEE 8th International ConferenceonCloud Computing.IEEE,2015.

[58] MohammadAazam, SheraliZeadally,and Khaled AHarras. Deploying Fog ComputinginIndustrialInternet of Things and Industry4.0. IEEE Transactions on Industrial Informatics,2018.

[59] Bo Tang, ZhenChen,GeraldHefferman,Shuyi Pei, TaoWei, HaiboHe, and Qing Yang. IncorporatingIntelligence in FogComputing for BigDataAnalysisinSmart Cities. IEEE Transactions on IndustrialInformatics,2017.

[60] ShanheYi, Cheng Li, and QunLi. ASurvey of FogComputing: Concepts, Applications and Issues.In2015 WorkshoponMobile BigData.Association for ComputingMachinery,2015.

[61] Philip Russom et al.Big DataAnalytics. TDWI Best Practices Report,2011.

[62] HanHu, YonggangWen,Tat-SengChua, and Xuelong Li. Toward Scalable Systems forBig Data Analytics:ATechnology Tutorial. IEEEaccess,2014.

101 [63] Paris Carbone, Asterios Katsifodimos, Stephan Ewen,VolkerMarkl, Seif Haridi, and Kostas Tzoumas.ApacheFlink: Streamand Batch Processing in aSingle Engine. IEEE Computer Society Technical CommitteeonData Engineering,2015.

[64] Akaash VishalHazarika, G. Jagadeesh Sai RaghuRam, and EetiJain. Performance Comparision of Hadoop and Spark Engine. In Proc.Int. Conf.IoT Soc. Mobile, Anal.Cloud,I-SMAC2017,2017.

[65] Supun Kamburugamuveand GeoffreyFox.Survey of Distributed Stream Processing. Bloomington:Indiana University,2016.

[66] EhabQadah, Michael Mock,Elias Alevizos, and Georg Fuchs. LambdaArchitecture forBatch and Stream Processing. CEURWorkshop,2018.

[67] Vivek Kale. Parallel Computing Architectures and APIs: IoT BigData Stream Processing.CRC Press,2019.

[68] SirishChandrasekaran, Owen Cooper,Amol Deshpande, Michael JFranklin, JosephMHellerstein,Wei Hong, SaileshKrishnamurthy, Samuel RMadden, Fred Reiss, and MehulAShah. TelegraphCQ: Continuous DataflowProcessing.InACM SIGMODInternational ConferenceonManagement of Data,2003.

[69] ArvindArasu, Brian Babcock,ShivnathBabu,Mayur Datar, Keith Ito,Itaru Nishizawa,JustinRosenstein, andJenniferWidom. Stream: The stanfordstream data manager.InACMSIGMOD InternationalConferenceonManagement of Data,2003.

[70] DanielAbadi, Donald Carney,Ugur Cetintemel, MitchCherniack, Christian Convey, CErwin, EduardoGalvez, MHatoun, AnuragMaskey,Alex Rasin, et al. Aurora: AData StreamManagement System.InACMSIGMODInternational Conference on Management of Data,2003.

[71] DanielJAbadi, YanifAhmad, Magdalena Balazinska, Ugur Cetintemel,Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey,Alex Rasin, Esther Ryvkina,etal. TheDesign of theBorealis Stream ProcessingEngine. In BiennialConferenceonInnovative Data Systems Research,number2005, 2005.

[72] William Thies,MichalKarczmarek,and SamanAmarasinghe. StreamIt:ALan- guage for Streaming Applications.InInternationalConferenceonCompiler Con- struction.Springer,2002.

[73] BugraGedik,HenriqueAndrade, Kun-Lung Wu,Philip SYu, and Myungcheol Doo. SPADE:The System SDeclarativeStream Processing Engine. In ACMSIGMOD International ConferenceonManagement of Data,2008.

[74] AlainBiem,Eric Bouillet, Hanhua Feng,Anand Ranganathan, Anton Riabov, OlivierVerscheure, HarisKoutsopoulos, and Carlos Moran. IBM infosphere Streams

102 forScalable, Real-time, IntelligentTransportationServices. In ACMSIGMOD International ConferenceonManagement of Data,2010.

[75] MJankowski, PPathirana, and STAllen.StormApplied: Strategies forReal-time EventProcessing, 2015.

[76] Marcos Dias de Assunção, Alexandre da Silva Veith, and Rajkumar Buyya.Dis- tributed Data StreamProcessing and Edge Computing: ASurvey on Resource Elasticity and Future Directions. Journal of Networkand Computer Applications, 2018.

[77] Holden Karau and RachelWarren. High PerformanceSpark:Best Practicesfor Scalingand Optimizing Apache Spark.O’ReillyMedia, 2017.

[78] WeiHuang,Lingkui Meng, Dongying Zhang, and WenZhang. In-memory Parallel Processing of Massive Remotely Sensed Data Using an Apache Spark on Hadoop Yarn Model. IEEE Journal of SelectedTopics in AppliedEarth Observations and RemoteSensing,2016.

[79] Arun Kejariwal, SanjeevKulkarni,and Karthik Ramasamy. Real Time Analytics: Algorithms and Systems. Proceedings of the VLDB Endowment,2015.

[80] Ashkan Yousefpour,Caleb Fung, TamNguyen,Krishna Kadiyala, Fatemeh Jalali, AmirrezaNiakanlahiji,JianKong,and JasonPJue. All OneNeeds to Know about FogComputing and Related Edge ComputingParadigms: ACompleteSurvey. JournalofSystems Architecture,2019.

[81] Fatima Hussainand Ameera Al-Karkhi. InternetofThings:Building Blocks and Business Models,chapterBig Data and FogComputing. Springer,2017.

[82] MarceloYannuzzi, Rodolfo Milito,RenéSerral-Gracia, Diego Montero, and Mario Nemirovsky.Key Ingredients in an IoT Recipe: FogComputing, Cloud Computing, and moreFog Computing.InIEEE 19th InternationalWorkshop on Computer AidedModeling and Design of Communication Links and Networks,2014.

[83] SubhadeepSarkar, Subarna Chatterjee, and SudipMisra.Assessment of the Suit- ability of FogComputing in theContext of Internet of Things. IEEE Transactions on Cloud Computing,2015.

[84] P. Silva, A. Costan, and G. Antoniu.TowardsaMethodology for Benchmarking EdgeProcessing Frameworks. In IEEE InternationalParallel and Distributed Processing Symposium Workshops(IPDPSW),2019.

[85] Apache EdgentOverview. http://edgent.incubator.apache.org/docs/home.html.Last accessed: 2021-01-29.

103 [86] Gerrit Janßen,IlyaVerbitskiy, Thomas Renner, and LauritzThamsen. Scheduling StreamProcessing TasksonGeo-distributed Heterogeneous Resources. In IEEE InternationalConferenceonBig Data.IEEE,2018.

[87] Christoph Hochreiner, Michael Vogler,Philipp Waibel, and Schahram Dustdar. VISP:AnEcosystemfor ElasticDataStreamProcessing forthe Internet of Things. In IEEE20thInternational Enterprise DistributedObject Computing Conference (EDOC).IEEE, 2016.

[88] Thomas Hießl,Christoph Hochreiner, and Stefan Schulte. Towards aFramework forData Stream Processing in theFog. InformatikSpektrum,2019.

[89] Hooman Peiro Sajjad, Ken Danniswara, Ahmad Al-Shishtawy,and Vladimir Vlassov. SpanEdge: Towards Unifying Stream ProcessingoverCentral and Near-the-Edge Data Centers.InIEEE/ACMSymposiumonEdge Computing (SEC),2016.

[90] BinCheng, Apostolos Papageorgiou, Flavio Cirillo,and ErnoeKovacs. GeeLytics: Geo-distributed EdgeAnalyticsfor Large Scale IoT Systems BasedonDynamic Topology. In IEEE2nd WorldForum on InternetofThings (WF-IoT),2015.

[91] Dan O’Keeffe, Theodoros Salonidis, and Peter Pietzuch. Frontier: ResilientEdge Processing forthe Internet of Things. Proceedingsofthe VLDB Endowment,2018.

[92] Steffen Zeuch,Ankit Chaudhary,Bonaventura Del Monte, Haralampos Gavriilidis, Dimitrios Giouroukis, Philipp M. Grulich,Sebastian Bress, Jonas Traub,and Volker Markl.The NebulaStream Platform: Dataand Application Management for the Internet of Things. ArXiv,2019.

[93] Davaadorj Battulga, Daniele Miorandi,and CédricTedeschi. FogGuru: AFog ComputingPlatform BasedonApache Flink. In 23rdConferenceonInnovation in Clouds, Internet and Networks and Workshops (ICIN).IEEE, 2020.

[94] DeryaUcuzetal. Comparisonofthe IoTPlatform Vendors,Microsoft Azure,Ama- zon WebServices, and Google Cloud,from Users’ Perspectives.In8th International SymposiumonDigitalForensicsand Security (ISDFS).IEEE,2020.

[95] Paola Pierleoni,Roberto Concetti,Alberto Belli, and Lorenzo Palma. Amazon, Google and MicrosoftSolutions forIoT: Architectures and aPerformance Compar- ison. IEEE Access,2019.

[96] Amazon AWSGreengrass. https://aws.amazon.com/greengrass/.Last accessed: 2021-01-29.

[97] Microsoft Azure IoT. https://azure.microsoft.com/en-us/services/iot-edge/.Last accessed: 2021-01-29.

[98] Google CloudIoT. https://cloud.google.com/iot-core/.Last accessed:2021-01-29.

104 [99] Anirban Das,StacyPatterson, and Mike Wittie. Edgebench: Benchmarking Edge ComputingPlatforms. In IEEE/ACM International ConferenceonUtility and Cloud Computing Companion(UCCCompanion).IEEE, 2018.

[100] Cristian MartínFernández, Manuel Díaz Rodríguez,and Bartolomé Rubio Muñoz. An edge computingarchitecture in the internet of things. In IEEE 21st International Symposium on Real-TimeDistributed Computing (ISORC).IEEE,2018.

[101] Manuel Díaz, Cristian Martín, and Bartolomé Rubio. λ-coap:Aninternet of things and cloud computingintegration based on thelambdaarchitecture and coap.In InternationalConferenceonCollaborative Computing: Networking, Applications and Worksharing.Springer, 2015.

[102] Yoji Yamato, Hiroki Kumazaki, and Yoshifumi Fukumoto. Proposal of lambda architecture adoption for real time predictivemaintenance. In 4th International Symposium on Computingand Networking.IEEE, 2016.

[103] TasneemSJDarwish and Kamalrulnizam Abu Bakar. FogBased IntelligentTrans- portation Big Data AnalyticsinThe Internet of VehiclesEnvironment: Motivations, Architecture, Challenges, andCriticalIssues. IEEE Access,2018.

[104] WeiWang, Lei Fan, Pu Huang, and Hai Li. ANew DataProcessing Architecture for Multi-Scenario Applications in AviationManufacturing. IEEE Access,2019.

[105] JayKreps. Questioning the LambdaArchitecture. O’Reilly,2014.

[106] Jimmy Lin.The Lambdaand the Kappa. IEEE Internet Computing,2017.

[107] ApisitSanlaand Thanisa Numnonda. AComparativePerformance of Real-time Big Data Analytic Architectures. In IEEE 9thInternational ConferenceonElectronics Information and Emergency Communication (ICEIEC).IEEE, 2019.

[108] Maxim Chernyshev, Zubair Baig, Oladayo Bello, and SheraliZeadally.Internet of Things (IoT):Research,Simulators, and Testbeds. IEEE InternetofThings Journal,2017.

[109] Harshit Gupta, Amir Vahid Dastjerdi, Soumya KGhosh,and Rajkumar Buyya. iFogSim: AToolkit forModelingand SimulationofResource managementTech- niques in the InternetofThings, Edgeand FogComputing Environments. Software: Practiceand Experience,2017.

[110] Xuezhi Zeng, SaurabhKumar Garg, Peter Strazdins, Prem Prakash Jayaraman, Dimitrios Georgakopoulos, and RajivRanjan. IOTSim: ASimulatorfor Analysing IoT Applications. Journal of Systems Architecture,2017.

[111] Cagatay Sonmez,AtayOzgovde, and Cem Ersoy. EdgeCloudSim:AnEnvironment forPerformance Evaluation of Edge ComputingSystems. Transactions on Emerging Telecommunications Technologies,2018.

105 [112] FredrikOsterlind, Adam Dunkels, Joakim Eriksson,NiclasFinne,and Thiemo Voigt. Cross-level Sensor Network Simulation withCooja.In31st IEEE Conference on LocalComputer Networks.IEEE, 2006.

[113] Christian Kunde and Zoltán Ádám Mann. Comparison of Simulators forFog Computing. In 35th Annual ACMSymposiumonAppliedComputing,2020.

[114] RubenMayer, Leon Graser, Harshit Gupta, EnriqueSaurez,and Umakishore Ramachandran. EmuFog: Extensible and Scalable EmulationofLarge-scale Fog ComputingInfrastructures. In IEEE FogWorldCongress (FWC).IEEE,2017.

[115] Antonio Coutinho, Fabiola Greve, Cassio Prazeres, andJoao Cardoso. FogBed:A Rapid-prototypingEmulationEnvironmentfor FogComputing. In IEEE Interna- tional ConferenceonCommunications (ICC).IEEE, 2018.

[116] BukharyIkhwanIsmail, Ehsan Mostajeran Goortani,Mohd BazliAbKarim, Wong Ming Tat, Sharipah Setapa,Jing Yuan Luke, and OngHong Hoe. Evaluation of Docker as EdgeComputingPlatform. In IEEE ConferenceonOpenSystems (ICOS).IEEE, 2015.

[117] CedricAdjih, EmmanuelBaccelli, Eric Fleury,Gaetan Harter,Nathalie Mitton, Thomas Noel,Roger Pissard-Gibollet,Frederic Saint-Marcel, GuillaumeSchreiner, JulienVandaele,etal. FITIoT-LAB: ALarge Scale Open ExperimentalIoT Testbed. In IEEE 2ndWorld Forum on InternetofThings (WF-IoT).IEEE, 2015.

[118] LuisSanchez, LuisMuñoz, Jose Antonio Galache,Pablo Sotres, JuanRSan- tana,VeronicaGutierrez, Rajiv Ramdhany, AlexGluhak,SrdjanKrco,Evangelos Theodoridis, et al. SmartSantander:IoT ExperimentationoveraSmart City Testbed. Computer Networks,2014.

[119] JunyanMa, JinWang, and Te Zhang. ASurvey of RecentAchievements for Wireless SensorNetworks Testbeds.InInternationalConferenceonCyber-Enabled DistributedComputing and Knowledge Discovery (CyberC).IEEE,2017.

[120] ZiyaKarakaya, Ali Yazici,and MohammedAlayyoub.AComparison of Stream Processing Frameworks. In InternationalConferenceonComputer and Applications (ICCA).IEEE, 2017.

[121] Sanket Chintapalli, DerekDagit,BobbyEvans, Reza Farivar, Thomas Graves, MarkHolderbaugh, Zhuo Liu,Kyle Nusbaum,KishorkumarPatil, Boyang Jerry Peng,etal. Benchmarking Streaming Computation Engines:Storm, Flink and Spark Streaming. In IEEE 30th InternationalParallel and DistributedProcessing Symposium (IPDPS).IEEE, 2016.

[122] HenriqueC.M.Andrade, Buğra Gedik,and DeepakS.Turaga. Fundamentals of StreamProcessing: Application Design, Systems, and Analytics.Cambridge UniversityPress, 2014.

106 [123] FabianHueskeand VasilikiKalavri. StreamProcessingwithApache Flink: Fun- damentals,Implementation,and OperationofStreamingApplications.O’Reilly Media,2019.

[124] Apache Kafka Documentation. https://kafka.apache.org.Last accessed: 2021-01-29.

[125] ShusenYang. IoT Stream Processing and Analyticsinthe Fog. IEEE Communica- tions Magazine,2017.

[126] Goiuri Peralta, Markel Iglesias-Urkia, Marc Barcelo,RaulGomez, Adrian Moran, and Josu Bilbao. FogComputing BasedEfficientIoT Scheme for theIndustry4.0. In IEEE InternationalWorkshop of Electronics, Control, Measurement,Signals and theirApplication to Mechatronics (ECMSM).IEEE, 2017.

[127] PhilippeDobbelaereand Kyumars Sheykh Esmaili.Kafka versusRabbitMQ:A ComparativeStudyofTwo Industry Reference Publish/SubscribeImplementations. In 11th ACMInternational ConferenceonDistributedEvent-BasedSystems,2017.

[128] Elarbi Badidi. Towards aMessage Broker BasedPlatform for Real-time Streaming of Urban IoT Data. In Computational Methods in Systemsand Software.Springer, 2018.

[129] Aditya Eka Bagaskara, Setyorini Setyorini, and Aulia ArifWardana. Performance Analysis of Message Broker for Communication in FogComputing. In 12th Interna- tional ConferenceonInformation Technology and ElectricalEngineering (ICITEE). IEEE, 2020.

[130] Ashish Thusoo, Joydeep SenSarma, NamitJain, Zheng Shao, Prasad Chakka, SureshAnthony,Hao Liu,Pete Wyckoff,and Raghotham Murthy.Hive: AWare- housing Solution overaMap-Reduce Framework. Proceedings of the VLDB En- dowment,2009.

[131] Sanket Chintapalli, Derek Dagit, BobbyEvans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu,Kyle Nusbaum, Kishorkumar Patil, Boyang JerryPeng, et al. Benchmarking Streaming Computation Engines:Storm, Flink and Spark Streaming. In IEEE InternationalParalleland DistributedProcessing Symposium Workshops (IPDPSW).IEEE, 2016.

[132] Apache Hbase Documentation. https://hbase.apache.org.Last accessed: 2021-01-29.

[133] Apache Cassandra Documentation. https://cassandra.apache.org.Last accessed: 2021-01-29.

[134] The DataStax Blog: Multi-DatacenterCassandra on 32 Raspberry Pi’s. https:// www.datastax.com/blog/2014/08/multi-datacenter-cassandra-32-raspberry-pis.Last accessed: 2021-01-29.

107 [135] RubenMayer,Harshit Gupta, Enrique Saurez, andUmakishoreRamachandran. Fogstore:Toward aDistributedDataStorefor FogComputing. In IEEE FogWorld Congress(FWC).IEEE, 2017.

[136] AlexKaplunovichand YelenaYesha. Consolidating Billions of Taxi Rides with AWSEMR and Spark in theCloud:Tuning, Analyticsand Best Practices. In IEEE InternationalConferenceonBig Data.IEEE,2018.

[137] Docker Documentation. https://docs.docker.com/get-started/.Last accessed:2021- 01-29.

[138] SaifulHoque, Mathias Santos de Brito, Alexander Willner,OliverKeil, and Thomas Magedanz. Towards Container Orchestration in FogComputing Infrastructures. In IEEE 41stAnnual Computer Softwareand ApplicationsConference.IEEE,2017.

[139] Thomas Vanhove,Gregory VanSeghbroeck,Tim Wauters, Bruno Volckaert, and Filip De Turck. Managing theSynchronization in theLambdaArchitecture for Optimized Big Data Analysis. IEICETransactions on Communications,2016.

[140] Apache Oozie Documentation. https://oozie.apache.org.Last accessed: 2021-01-29.

[141] DiegoOngaro and JohnOusterhout. In SearchofanUnderstandable Consensus Algorithm. In 2014 USENIXAnnualTechnical Conference,2014.

[142] LuizFBittencourt,Javier Diaz-Montes,RajkumarBuyya, Omer FRana, and ManishParashar. Mobility-aware Application Scheduling in FogComputing. IEEE CloudComputing,2017.

[143] Ayati Miatra and SumitKumar. Security Issues With FogComputing. In 10th Inter- national ConferenceonCloud Computing,Data Science Engineering (Confluence), 2020.

[144] PeiYun Zhang, MengChuZhou, and Giancarlo Fortino.Security and Trust Issues in FogComputing:ASurvey. FutureGeneration Computer Systems,2018.

[145] SeanBarker,AdityaMishra, DavidIrwin, Emmanuel Cecchet,Prashant Shenoy, and Jeannie Albrecht.Smart*: An Open DataSet and Tools for Enabling Research in Sustainable Homes. SustKDD,2012.

[146] Redowan Mahmud, Kotagiri Ramamohanarao, and Rajkumar Buyya.Latency- aware Application ModuleManagementfor FogComputing Environments. ACM Transactions on Internet Technology (TOIT),2018.

[147] Shreyas Badiger,ShreyBaheti, and Yogesh Simmhan. Violet:ALarge-scaleVirtual Environmentfor Internet of Things.InEuropean ConferenceonParallel Processing. Springer, 2018.

108