UNIVERSIDAD POLITÉCNICA DE MADRID

ESCUELA TÉCNICA SUPERIOR DE INGENERÍA Y SYSTEMAS DE TELECOMUNICACIÓN

EIT DIGITAL MASTER IN INTERNET TECHNOLOGY AND ARCHITECTURE

Secure LwM2M IoT streaming data pipelines in Hospworks

Master Thesis

Kajetan Maliszewski

Madrid, July 2019 Contents

1 Introduction 1 1.1 Problem description ...... 2 1.2 Purpose ...... 2 1.3 Goals ...... 2 1.4 Outline ...... 3

2 Background 4 2.1 IoT Architecture ...... 4 2.2 IoT Nodes ...... 5 2.3 IoT Gateway ...... 6 2.4 Hopsworks Ecosystem ...... 8 2.5 Apache Kafka ...... 9 2.6 Stream Processing ...... 10 2.7 Security ...... 10

3 Architecture 12 3.1 Components ...... 12 3.2 IoT Gateway in the Hopsworks Ecosystem ...... 14 3.3 IoT Gateway Architecture ...... 15 3.3.1 LeshanService ...... 16 3.3.2 DatabaseService ...... 16 3.3.3 ProducerService ...... 19 3.3.4 HopsworksService ...... 21 3.4 Hopsworks Architecture ...... 21 3.4.1 Hopsworks Database ...... 21 3.4.2 User Interface ...... 23 3.4.3 IotGatewayResource ...... 23 3.4.4 Data Storage ...... 24 3.4.5 Streaming Jobs ...... 24 3.5 IoT Nodes ...... 25 3.5.1 Endpoint Client Name ...... 25

i 3.5.2 Measurements Timestamping ...... 25 3.5.3 Measurement Life Cycle ...... 26 3.6 Security ...... 27

4 Implementation 29 4.1 IoT Nodes ...... 29 4.2 IoT Gateway ...... 30 4.2.1 LeshanService ...... 31 4.2.2 DatabaseService ...... 31 4.2.3 ProducerService ...... 32 4.2.4 HopsworksService ...... 32 4.3 Hopsworks ...... 33 4.3.1 Hopsworks Database ...... 33 4.3.2 IoTGatewayResource ...... 33 4.3.3 User Interface ...... 34 4.3.4 Streaming Jobs ...... 36 4.4 Installation ...... 37

5 Evaluation 39 5.1 Verification ...... 39 5.2 Validation ...... 39 5.2.1 Test setup ...... 40 5.2.2 Test with an IoT simulator ...... 40 5.2.3 Test with a real IoT device ...... 42 5.2.4 Multiple gateways test ...... 43 5.2.5 Failure test ...... 43 5.2.6 Anomaly Detection Test ...... 44 5.3 Benchmarking ...... 45 5.3.1 Latency in a local setup ...... 45 5.3.2 Latency in a remote setup ...... 46 5.3.3 Latency results analysis ...... 47 5.3.4 Cold and warm startup ...... 48

6 Conclusion 50 6.1 Goals Achieved ...... 50 6.2 Future Work ...... 51 6.3 Reflections ...... 52

Bibliography 53

ii List of Figures

2.1 Example of an IoT architecture [7]...... 4 2.2 Hopsworks ecosystem schema [18]...... 8

3.1 Project Architecture...... 12 3.2 IoT registration procedure...... 14 3.3 Example of tables generated for DatabaseService...... 19 3.4 IoT Gateway state in Hopsworks...... 22 3.5 New gateways table in Hopsworks database...... 23 3.6 Measurement life cycle...... 26 3.7 System Security Architecture...... 27

4.1 Zolertia Firefly (top) and Thunderboard Sense 2 (bottom)...... 30 4.2 Sequence diagram of a REST call getting the list of IoT Nodes...... 34 4.3 UI - Enter IoT Gateway Details window...... 35 4.4 UI - Overview of IoT tab...... 35 4.5 UI - IoT Gateway Details window...... 35 4.6 UI - IoT Nodes window...... 36

5.1 Screenshots of running IoT Gateway (top) and IoT Node simulator (bottom). 41 5.2 Screenshot of running Eclipse Leshan server...... 41 5.3 IoT simulator data retrieved from HopsFS...... 42 5.4 Kafka ACL after detection of too high traffic on a gateway...... 44 5.5 Measurement delivery time for local setup...... 46 5.6 Measurement delivery time for remote setup...... 47 5.7 Average latency benchmark result comparison...... 48 5.8 Measurement latency with cold and warm startup...... 49

iii List of Tables

3.1 HopsworksService REST API...... 21 3.2 IotGatewayResource REST API...... 24

5.1 bbc2 test machine specifications...... 40 5.2 computer test machine specifications...... 40 5.3 Software branches used for tests...... 40 5.4 Average latency benchmark results ...... 47

iv List of Acronyms

6LoWPAN IPv6 over Low-Power Wireless Personal Area Networks ACK Acknowledge ACL Access Control List API Application Programming Interface CoaP Constrained Application Protocol CoaPS DTLS-Secured Constrained Application Protocol DTLS Datagram DDoS Distributed Denial of Service EUI Extended Unique Identifier FS File System GPU Graphics Processing Unit Hops Hadoop Open Platform-as-a-Service HDFS Hadoop Distributed File System HTTP Hypertext Transfer Protocol HTTPS Hypertext Transfer Protocol Secure IMEI International Mobile Equipment Identity IP Internet Protocol IPSO Internet Protocol for Smart Objects IoT JDBC Java Database Connectivity JSON JavaScript Object Notation

v JVM Java Virtual Machine JWT JSON Web Token MAC Media Access Control ML Machine Learning MVC model-view-controller MVVM model-view-viewmodel NAT Network Address Translation OMA LwM2M Lightweight Machine-to-Machine PKI Public Key Infrastructure PSK Pre-Shared Key REST Representational State Transfer RPK Raw Public Key SQL Structured Query Language TLS Transport Layer Security TSDB Time-Series Database UI User Interface URN Uniform Resource Name UUID Universally Unique Identifier VM Virtual Machine

vi Summary

The number of internet connected devices has already by far surpassed the number of human beings. The pace of growth is still so big that in the next five years that number will double. The ecosystem of these devices, collectively called Internet of Things (IoT), is a source of a tremendous amount of data and creates several unheard challenges for researchers and companies. New, unconventional ways of storing, analyzing, and processing of the data had to be proposed. One such a solution is Hadoop Open Platform-as-a-Service (Hops), a result of years-long research between KTH Royal Institute of Technology in Stockholm and RISE SICS AB. It is a platform enabling an analysis of extremely large volumes of data with cutting-edge, open-source technologies for Big Data and Machine Learning (ML). This master thesis provides support for connecting these two environments. It provides instruments for secure and reliable ingestion of IoT data into Hops platform. Moreover, it provides tools for ensuring the level of security by supporting the execution of mitigating measures, such as automated exclusion of misbehaving devices and dropping traffic from sources of Distributed Denial of Service (DDoS) attacks. To allow the data ingestion a new element was introduced to the ecosystem - IoT Gateway. It is a platform, where the authenticated IoT devices can stream data to. Furthermore, Hopsworks, one of the Hops’ main component, was extended with REST API that allowed the gateways to securely connect to the Hops ecosystem. A testbed, including IoT software simulator and a real IoT device with dedicated hardware, was built and comprehensively tested and benchmarked. The architecture is based on the publicly open and very popular security protocols - Raw Public Key (RPK) and Hypertext Transfer Protocol Secure (HTTPS). It is shown that the proposed solution is performant, scalable, and provides high reliability in a real-life case scenario. Up to our knowledge, the work done in this thesis makes Hopsworks the world’s first open source Big Data platform with secure IoT data ingestion.

vii Resumen

La cantidad de dispositivos conectados a Internet, ya ha superado la cantidad de seres humanos. El ritmo de crecimiento es tan elevado que en los próximos cinco años se duplicará. El ecosistema de estos dispositivos, colectivamente llamado Internet of Things (IoT), es una fuente de gran cantidad de datos y crea varios retos inauditos para investigadores y empresas. Se han propuesto nuevas formas y poco convencionales de operaciones de los datos. Una de esas soluciones es Hadoop Open Platform-as-a- Service (Hops), el resultado de investigación entre KTH Royal Institute of Technology en Estocolmo y RISE SICS AB. Además, es una plataforma que permite un análisis de datos en cantidades extremadamente grandes con tecnologías innovadoras y open source de Big Data y Machine Learning (ML). Este proyecto fin de máster, proporciona soporte para conectar esas dos tecnologías. Esta plataforma también proporciona instrumentos para introducir de manera segura y de confianza los datos de IoT a la plataforma Hops. Además, proporciona herramientas para asegurar el nivel de seguridad, permitiendo la ejecución de medidas de mitigación, tales como exclusión automatizada de fuentes de ataques de tipo Distributed Denial of Service (DDoS). Para permitir la ingesta de datos, se ha introducido un nuevo elemento a esta tecnología - IoT Gateway. Se trata de una plataforma hacia la cual los dispositivos IoT ya autenticados pueden transmitir los datos. Hopsworks, un componente de Hops, ha sido extendido a través de REST API, lo que permite a los Gateways conectar con la plataforma Hops. Se ha desarrollado un banco de pruebas para probar dicha plataforma, además de un simulador de IoT y un dispositivo de IoT real con un hardware dedicado. La arquitectura está basada en protocolos de seguridad publicamente abiertos y muy extendidos - Raw Public Key (RPK) y Hypertext Transfer Protocol Secure (HTTPS). Por último, se concluye que la solución propuesta tiene un alto rendimiento, es escalable y proporciona una alta fiabilidad en un escenario real. Hasta donde llega nuestro conocimiento, se concluye que Hopsworks es la primera plataforma del mundo de open source de Big Data con ingesta de datos de IoT segura.

viii Acknowledgments

I would like to express my deepest gratitude to my supervisor Theofilos Kakantousis for the continuous support during the time of the internship. His experience in the fields of the project, a head full of ideas and accurate questions, and willingness to advice were extremely helpful and made the internship an invaluable experience. I would also like to thank my university supervisor Professor Mariano Ruiz for helping me with the project, making the thesis procedure smooth, and for agreeing to do all of it fully remotely.

ix Chapter 1

Introduction

In these days, people are connecting more and more internet-enabled devices. According to predictions, there will be around 22 billion connected devices in 2019 [1]. As stated in the same report, the pace of growth of that number will continue in the following years. We are entering a new era, where ordinary objects we are used to having, will start communicating with each other through the Internet. We are entering the era of IoT. Efficient processing of all the data that these devices are capable of producing is not a straightforward task. We needed to develop new ways of collecting information - systems that can operate on petabytes of data. These systems had to be designed having in mind scalability and fault-tolerance since the amount of data to process is only growing. The de facto standard of the industry became the Apache Hadoop Distributed File System (HDFS) project [2]. HopsFS [3] is an improved distribution of Hadoop. By storing the metadata in a NewSQL database the creators of HopsFS mitigated the main bottleneck of Hadoop making it sixteen times faster in terms of throughput and latency [4]. Hopsworks is the front-end for HopsFS. It provides integration for HopsFS with many popular Data Science tools making integration with Hadoop simple and accessible. Many of the IoT systems, such as intelligent cars, smart grids or healthcare systems, require reactions from the backend systems in near real time. This is why the IoT industry developed many solutions based on stream processing. It is a technique where the data is processed as it arrives in the system without previous storing. This enables the system to operate on much smaller volumes of data and to react in near real time. The most popular stream processing framework is Apache Kafka [5]. Many of the largest tech companies, such as Google or Amazon, have recently entered the IoT market offering specialized cloud solutions. It shows that there has been lots of

1 recent interest in integrating IoT and Big Data, but, to the best of our knowledge, there is currently no open source platform supporting secure end-to-end ingesting of IoT data into Big Data platforms. The work conducted in this thesis aims at making Hopsworks the first secure, open source platform capable of receiving massive scale IoT data. The project will also provide generic support for automated classification of IoT data to support anomaly detection, and denial-of-service attacks, providing mitigating measures, such as automated exclusion of misbehaving devices and dropping of traffic from sources of DDoS attacks. This project was a part of an industrial internship with Logical Clocks AB [6].

1.1 Problem description

The challenge of the work carried out in this thesis is the data streaming of time-based measurements coming from IoT devices into the Hops file system through its Hopsworks component [3]. The streaming needs to be performed securely and reliably. The solutions presented in this thesis must be scalable, that is to be built in a way that will allow them to handle traffic from thousands or millions of IoT devices.

1.2 Purpose

Successful delivery of a solution to the problem mentioned above will enable researchers and companies to easily collect measurements from IoT networks into the Hopsworks big data platform. Furthermore, it will enable usage of stream processing frameworks, machine learning, and deep learning tools that the Hopsworks platform offers. This can potentially be of great research and business value as it enables thorough analysis and access to insights that the data has not revealed before.

1.3 Goals

The main objectives of the project are to:

• Securely ingest data from IoT devices to Hopsworks

• Provide software tools for ensuring the level of security with anomaly detection

2 • Provide reliable delivery for potentially unstable environments

• Enable real-time processing and visualization of IoT data in Hopsworks

1.4 Outline

Following the introduction above, this paper discusses how to integrate IoT devices into Hopsworks. Chapter 2 presents the necessary background knowledge to understand the further parts of the thesis. Chapter 3 goes through each component’s design and architecture followed by chapter 4 that describes their implementations and provides detailed instructions on how to set up the environment and run the custom Hopsworks Virtual Machine (VM), IoT Gateway, and connect IoT Nodes. Chapter 5 goes through verification and validation processes. It presents the test setup and benchmarking process. Finally, chapter 6 concludes the thesis, presents future work, and reflections.

3 Chapter 2

Background

2.1 IoT Architecture

Figure 2.1: Example of an IoT architecture [7].

A simplified IoT system architecture that is most commonly found in the production installations contains three parts - IoT Nodes, IoT Gateways, and Cloud Services (figure 2.1). IoT Nodes are small IoT devices such as light sensors, temperature sensors, actuators, and others. They connect to the IoT Gateway through various wireless and cable protocols and send measurements. In a real-life installation case, there can be thousands of devices connecting to a gateway.

4 IoT Gateway is a "proxy" between the Cloud and the IoT Nodes. The nodes connect to it through a range of protocols and send measurements. The gateway usually takes care of authentication and authorization. It collects the measurements, aggregates them, and sends to the cloud. Cloud Services are services running on remote servers. It is usually a distributed setup, capable of storing and processing extremely large amounts of data. The Cloud Services receive measurements from IoT Gateways, process them, and eventually persist.

2.2 IoT Nodes

This section goes through technologies used for the IoT Nodes.

OMA LwM2M Growing demand for managing lightweight and low power devices let to the development of the Open Mobile Alliance Lightweight Machine-to-Machine (OMA LwM2M) application layer protocol [8] to achieve the true potential of IoT. Its modern architecture incorporates some of the well known open standards. It is based on Representational State Transfer (REST) and defines an extensible data model, ready for adaptation to the changing environment. It uses Constrained Application Protocol (CoaP) to secure communication in all places. It is based on the client-server model with the client running on a LwM2M Devices. In this project, the LwM2M protocol is used as the communication protocol between IoT Nodes and IoT Gateways. More details about the usage of LwM2M can be found in sections 3.5 and 4.1.

Eclipse Leshan One of the most popular Java implementations of OMA LwM2M is Eclipse Leshan (Leshan) [9]. It is a server and client implementation developed and managed by the Eclipse Foundation. At the time of writing, the project supported most of the features from the LwM2M specification. Leshan provides fairly good documentation and community support. In this thesis, IoT Gateway used Leshan to run a LwM2M server and IoT Nodes simulator used it for the LwM2M client. Detailed usage of Leshan can be found in sections 3.3.1, 4.1, and 4.2.1.

5 IoT Hardware Two hardware devices were used to run experiments. Thunderboard Sense 2 Sensor-to-Cloud Advanced IoT Development Kit [10] is a development platform aiming at easing the process of prototyping for developers. It is filled with many sensors and I/O controls, some of which are humidity and temperature sensor, light sensor, pressure sensor, accelerometer, LEDs, and others. It is also equipped with radio chips for wireless communication. The second device was Zolertia Firefly [11]. It is a board helping the developers designing and building IoT systems. It is equipped with multiple antennas making it a popular choice for a border router. During the experiments, Sense 2 board was used as a prototype of an IoT device. Zolertia Firefly was used as a border router enabling wireless communication between Sense 2 and a local machine. More details about the test setup can be found in section 4.1.

Contiki-NG Contiki-NG [12] is an operating system for the next generation IoT devices. It was designed to run on devices heavily constrained in processing power, memory, and network bandwidth. It has a lightweight, built-in TCP/IP stack and implements many of the modern network protocols, e.g. 6LoWPAN, RPL, CoAP, or LwM2M. It comes with many demo applications, among which is a temperature demo using LwM2M protocol [13]. Contiki-NG is open source and is a popular choice among IoT projects. Contiki-NG was used during the tests as the operating system for Thunderboard Sense 2 board described in the paragraph above.

2.3 IoT Gateway

This section describes the technologies and tools used to build the IoT Gateway.

Scala Scala [14] is a multi-paradigm, general-purpose programming language developed at École Polytechnique Fédérale de Lausanne. It combines object-oriented and functional programming. Scala runs on Java Virtual Machine (JVM) making it fully interoperable with Java. It has a strong and static type system. Scala was used as the main language for building the IoT Gateway (section 4.2) and some of the Spark Streaming jobs (section 4.3.4).

6 Actor Model and Akka Actor Model [15] is a model of concurrent computation. It defines a set of system components and rules on how they should behave and interact. The primitive unit of computation in the actor model is an actor. The actors have states and mailboxes - a way they can receive messages from other actors and act accordingly. The messages are processed sequentially in the order of arrival. This means that there can never be deadlocks nor race conditions. The actors have very limited functionality. An actor can create more actors, send messages to other actors, or decide how to behave with the next message from the mailbox (state mutation). Akka is a free and open source toolkit for concurrent and distributed applications. It is written in Scala programming language and is an implementation of the Actor Model. It provides Application Programming Interface (API) bindings for Scala and Java making it a popular base for many projects. Akka HTTP is a module of Akka that implements a full server- and client-side HTTP stack on top of the Akka actor. It provides a set of tools for developers for providing and consuming HTTP-based services. Akka is used as the foundation of the IoT Gateway (section 4.2). Akka HTTP is used to handle all Hypertext Transfer Protocol (HTTP) traffic in the HopsworksService (sections 3.3.4 and 4.2.4).

H2 Database H2 [16] is an open source Structured Query Language (SQL) database engine written in Java. It provides a very fast Java Database Connectivity (JDBC) API. It can work as an in-memory database in embedded and server modes. Working in the embedded mode it has a notably small footprint of around 2 MB. H2 is used as the underlying database in IoT Gateway’s DatabaseService. See section 3.3.2 why H2 was selected and section 4.2.2 for implementation details.

Slick Slick [17] is a functional relational mapping framework for Scala. It allows writing SQL queries and manages databases in a Scala-collections way. It provides type safety making sure the queries are statically checked at compile-time and composability similar to Scala collections. Slick supports several well-known databases - DB2, Apache Derby, H2, and others. Slick is used to query the H2 database in the DatabaseService (section 4.2.2).

7 2.4 Hopsworks Ecosystem

The Hopsworks ecosystem [3] is a set of tools and services that form a scale-out Data Science platform. Among many features, it provides management of Graphics Processing Unit (GPU) as a Resource, end-to-end support for Deep Learning workflows, and support for the most popular Big Data and ML frameworks. It is available through both User Interface (UI) and REST API, collectively called Hopsworks. It is based on a Transport Layer Security (TLS) Certificate security model. Figure 2.2 shows the schema of the Hopsworks ecosystem.

Figure 2.2: Hopsworks ecosystem schema [18].

Hopsworks introduces the concept of projects. A project can be seen as a repository containing datasets, users, and programs. Each project is completely isolated from each other making the data secure and inaccessible from the outside.

JAX-RS Hopsworks’ REST API service was built using JAX-RS [19], a Java API supporting creating web services with the REST architectural pattern. It uses Java annotations as a core development concept. JAX-RS was used to implement IoTGatewayResource, a new component of Hopsworks (section 4.3.2).

AngularJS AngularJS [20] is an open source, front-end web framework written in JavaScript. The project is maintained by Google and its community. It addresses many of

8 the challenges while building single-page applications. It provides frameworks for client- side architectures model-view-viewmodel (MVVM) and model-view-controller (MVC). The framework is used to run websites of companies like Intel, NBC, and many others. AngularJS is used by Logical Clocks to run the Hopsworks front-end. In this thesis, the framework was used to implement the IoT UI tab (section 4.3.3).

Chef Chef [21] is a deployment automation tool. It makes the process of deployment easier. It automates configuration, deployment, and management of the infrastructure across the network. Chef is used by Logical Clocks for Hopsworks VM deployment. In this thesis, some of the Chef scripts were extended to support IoT features.

2.5 Apache Kafka

Apache Kafka (Kafka) is a distributed streaming platform [5]. It has the capabilities of a messaging system with users interacting through the publisher and subscriber API. Through Kafka brokers, it replicates the data and makes the system fault-tolerant. It scales horizontally making it efficient in processing extremely large amounts of data. Kafka uses the concept of topics to store streams of records. Each of them defines an Access Control List (ACL) to manage access to its messages. Each ACL contains a list of permissions describing which users are allowed to produce or consume the records. Each record contains a key, a value, and a timestamp. The records are sent between clients and servers using a binary, high-performance TCP protocol.

Apache Avro Kafka can transmit data in any format. Nonetheless, it is crucial to keep the systems consistent and to follow the design choices on the cluster. Logical Clocks uses Apache Avro [22] format as its default Kafka data format. Avro is an open source data serialization system. Data types are defined using JavaScript Object Notation (JSON) and later serialized to a compact binary format. Avro uses schemas to structure the encoded data. The schemas are defined by the user using JSON and consist of primitive and complex types [23].

9 2.6 Stream Processing

Stream processing is a paradigm in computer programming that allows applications to process the incoming events as they arrive in the system. It is opposite to batch processing, which is based on the processing of previously stored datasets. Stream processing allows working on much smaller amounts of data and processing the records in near real time. This technique turned out to be extremely useful in many real-life systems, such as user activity on websites, financial transactions, and anomaly detection just to name a few.

Apache Spark Apache Spark (Spark) is a distributed framework for general-purpose cluster computing. It is a unified analytics engine for Big Data and ML. At its core is a general execution engine called Spark Core. The core provides interface for programming in several languages like Scala, Java, Python, and SQL, with implicit fault tolerant and data parallelism. Spark Streaming is a module built on top of the core. It is used to perform streaming analytics. It extends the capabilities of Spark Core to applications that need to process data in real time. The power of Spark Streaming is the maintenance of the key features of the core - data parallelism and fault tolerance. It also comes with integration with some of the most popular data sources, including Kafka.

2.7 Security

JWT JSON Web Token (JWT) [24] is a standard for creating tokens that grants access to certain resources. The token contains a number of claims, defining its properties, accesses, and others. It is typically used in a stateless authentication mechanism. After a successful login, the user would receive a JWT token with which it would later request access to certain resources. Logical Clocks uses JWT for authentication in the Hopsworks REST API.

Public Key Infrastructure Public Key Infrastructure (PKI) is a security system that enables the usage of digital certificates and management of public-key encryption [25]. The PKI helps to identify devices or services where there is a need for authentication. It is an essential element of the modern web providing secure communication for billions of

10 people, services, and devices. Hopsworks VM sets a self-signed certificate authority and issues certificates for each user. To push the IoT data to the Hopsworks Kafka broker, the IoT Gateway needs to present its certificates which are downloaded during the start process. More details can be found in section 4.2.3.

11 Chapter 3

Architecture

This section describes how the design challenges were solved in this project.

3.1 Components

Figure 3.1 presents the overall architecture of the project. It is similar to figure 2.1. The Cloud Services were broken into more details to better present the flow of the data.

Figure 3.1: Project Architecture.

IoT Nodes In this project’s architecture IoT Nodes behave in the same way as in a typical IoT architecture described in section 2.1.

12 IoT Gateway is the gateway for the IoT Nodes. It provides a connection between Hopsworks and the nodes (proxy). On one side, it runs a server that allows IoT Nodes to create a connection and send measurements. The server collects the data from the devices. On the other side, the IoT Gateway connects to a Publisher/Subscriber (Pub/Sub) broker. It then constantly checks for new measurements and publishes new data to the broker. It is required from the Pub/Sub broker to confirm the reception of the data. It typically does it through Acknowledge (ACK) messages that are sent back to the gateway. The IoT Gateway receives them and can remove the ACKed data from its database. This way, at-least-once semantics are guaranteed. In case the Pub/Sub broker does not confirm the incoming messages the IoT Gateway would not be able to determine which messages were delivered and which were lost. In this configuration, we could expect at-most-once semantics. This solution is not satisfying and so it is required from the broker to send ACK messages back to the gateways. The IoT Gateway ensures the data integrity across the whole pipeline and so it needs to be able to recover from an unexpected failure or other errors. For that reason, each IoT Gateway runs a local database. It is used to immediately store measurements that arrive at the gateway. They are then sent to a broker. Only once the broker confirms reception, the measurement can be deleted from the local database. This also makes the gateways resilient to more sudden scenarios. For example, in case of an unplanned reboot, the IoT Gateway simply reads the content of the local database and sends the unACKed data.

Pub/Sub Broker is the point where the IoT Gateways will push the data. Hopsworks has an already running Apache Kafka cluster [5], which serves the role of a Pub/Sub broker. It is capable of receiving any kind of data (including IoT measurements) through its predefined topics. More about Kafka can be found in the documentation [5].

Stream Processing is a module performing real-time analysis and anomaly detection of the IoT data. It reacts to anomalies with different behaviors. In case of a misbehaving device, it can instruct the IoT Gateway to ignore traffic from this device. In case of a misbehaving IoT Gateway, it can revoke its access to the Kafka cluster by removing it from the Kafka’s ACL. Stream processing jobs are also responsible for persisting the data in a database.

13 Storage The incoming data has to be stored in a safe and reliable storage. Hopsworks comes with HopsFS [3], a distributed file system. It is used in this project to store the data.

Visualization The last step is the visual presentation of the data. Users need to be able to create graphs and charts from the measurements.

3.2 IoT Gateway in the Hopsworks Ecosystem

This section describes how the IoT Gateways and IoT Nodes are authenticated and managed from the Hopsworks platform. All the gateways and nodes push data to Hopsworks under the same user "IoT" and identified with email “[email protected]”. The IoT feature has to be activated by the project administrator. In the background, it adds write permissions to the "IoT" user.

Figure 3.2: IoT registration procedure.

14 Each IoT Gateway has to be added through the "IoT" tab in Hopsworks. After entering its name, Internet Protocol (IP) address, and port, Hopsworks would send a JWT token to the gateway. The gateway would now be able to download its PKI certificates and Avro schemas. Once it is done, the gateway starts producing messages to Kafka over a secure connection. The procedure, presented in figure 3.2, contains nine steps:

1. User activates the IoT feature through the Hopsworks UI.

2. User registers a new IoT Gateway by providing its name, IP address, and port.

3. Hopsworks sends a JWT token to the IoT Gateway

4. IoT Gateway authenticates itself with the JWT token and requests PKI certificates and password.

5. Hopsworks responds with PKI certificates and password.

6. IoT Gateway requests Avro schemas for LwM2M Kafka topics.

7. Hopsworks responds with the Avro schemas.

8. IoT Gateway streams IoT data to Hopsworks.

9. Hopsworks responds with ACK messages.

3.3 IoT Gateway Architecture

Each IoT Gateway needs multiple services:

• LeshanService - serves the role of LwM2M server. It accepts incoming connection requests from IoT Nodes, receives data, and stores it in a local database.

• DatabaseService - stores ready-to-send and unACKed IoT data.

• ProducerService - polls the local database and in case of new data pushes it to Hopsworks Kafka Cluster.

• HopsworksService - exposes endpoints for communication with Hopsworks

At a very early stage of design, it was decided to use Scala Programming Language [14] for building the IoT Gateway. Scala is a modern, multi-paradigm, high-level language. It allows writing both functional and object-oriented code. It runs on JVM, which aligns with Logical Clocks’ technology solutions and provides a large ecosystem of libraries. It is

15 popular among projects related to IoT and Big Data making it a perfect candidate for IoT Gateway. It is important to mention the selection of Scala, as the next design choices depend heavily on the environment it provides.

3.3.1 LeshanService

The LeshanService is responsible for setting up and maintaining OMA LwM2M server, connections with IoT Nodes, receiving IoT data, and passing it to next services in the pipeline. It needs to hold the state of currently connected IoT Nodes. It is required for LeshanService to have configurable network parameters. Eclipse Leshan [9] was selected as the OMA LwM2M server implementation. JVM implementation, strong community, and popularity made it a good choice for the project.

3.3.2 DatabaseService

The main goal of the DatabaseService is to provide higher reliability of measurements delivery. It is designed to handle consistency in case of unexpected crashes of an IoT Gateway, breaks in the availability of Kafka brokers, or other undesired situations. The DatabaseService holds the role of an on-disk buffer. It stores the measurements received from the IoT Nodes on disk and deletes them only once it receives an acknowledgment from a Kafka broker. The following paragraphs will describe the design process, selection of the tools, design, and other possible solutions.

Embedded vs Standalone Database A database can run in an embedded mode [26] or as a standalone application. An embedded mode is characterized by hiding the database under the hood of the program. It is accessible only from within the running software and the end-user is oblivious to its existence. In case of running a JVM program, the database is only accessible from that JVM. A standalone database is usually running as a separate process and is accessed through a REST API. It runs completely independently and can be accessed by multiple different users. The database used for an IoT Gateway will have only one client - the gateway itself. To achieve high performance it has to be run on the same machine as the gateway. It is also required to be up only as long as the gateway is up and running. There is no need to

16 maintain it in case of a gateway crash. The ease of setting up and maintenance of embedded databases, as well as arguments mentioned earlier, let to the selection of an embedded database for the project.

Database Type Selection There are many different types of databases available on the market. In the context of IoT the three types that are widely used are:

• Time-series databases

• Key-value stores

• Relational databases

A Time-Series Database (TSDB) [27] is a database that was optimized for time series or time-stamped data. It usually comes with a set of tools supporting operations on time series data - creation, update, deletion, and enumeration. It also handles the data lifecycle management and large range scans in a way that is specific for the kind of data it deals with. Typical examples of stored records are sensor data, server metrics, stock prices, and network traces. TSDBs are designed to work on large sets of data and have built-in algorithms for large scans. These are not features required by DatabaseService. The service will not perform large scans, nor work on large sets of data. A key-value store [28] is a type of database that stores data using a simple method of key-value. The other names used are dictionary and hash table. It associates a unique key with a value. Both of them can be of any type - from primitive types like int or long to a complex nested and compound objects. They are usually optimized for distributed set-up and horizontal scaling making it a popular choice for cloud-based applications. Overall, these characterizations do not make it a good choice for the DatabaseService use-case. A relational database [29] is the most known type of storing information. It stores data in tables that are organized in rows and columns. Each row is identified with its unique primary key. They use SQL for operations on data and maintenance. Relational databases are considered a mature technology that has been optimized for decades and adapted to changing needs. They are fast, robust, and with a small memory footprint for embedded mode. There is a big selection of them for the JVM environment. It makes it a solid candidate for a DatabaseService database.

17 JVM SQL Database Selection The three most commonly used JVM SQL databases with good support are H2 [16], Apache Derby [30], and HSQLDB [31]. The selection was based on the Features table in [16] and performance tests in [32], that fit the use-case of DatabaseService ("single connection benchmark, run on one computer, with many very simple operations running against the database" [32]). The first source shows the rich collection of features that H2 has and the second source shows very good performance compared to others. H2 was selected as the database used for the project. H2 can be used either directly with a JDBC interface or with a third-party framework that performs the JDBC operations under the hood. The first method quickly becomes tedious and is not comfortable to use with the Scala language, so it was decided to use a third- party framework. A framework that uses Scala and supports H2 is Slick [17]. It provides all the needed features - type safety, Scala API, and composability. It is also maintained by a reliable party - Lightbend. It was decided to use Slick to interact with the underlying H2 instance.

Design The database must store two types of data - blocked EndpointClientNames and buffered IoT measurements. The table BLOCKED_ENDPOINTS stores the EndpointClientNames of devices that are meant to be blocked. It consists of only one field Endpoint, which is also its primary key. The design choices for tables representing the measurements were strictly motivated by OMA LwM2M specification [33]. The top-level table, MEASUREMENTS, stores data required to identify a specific measurement. Its primary key is composed of three fields - TIMESTAMP, ENDPOINT_CLIENT_NAME, and INSTANCE_ID (a single device with unique EndpointClientName can produce multiple instances of an Internet Protocol for Smart Objects (IPSO) object with the same timestamp, thus a need for a composite primary key). Additionally, every IPSO object has its equivalent table with additional MEASUREMENT_ID column that refers to a record in the MEASUREMENTS table. An example of the temperature object 3303 table and the table for blocked devices can be seen in figure 3.3.

18 Figure 3.3: Example of tables generated for DatabaseService.

Local Kafka Broker Instead of a Database An alternative to the database could be a local Kafka Broker. The broker would collect the measurements and replicate them to the Hopsworks’ Kafka broker. From the implementation point of view, this concept could be easier than the database. Kafka is very easy to set up and maintain in a non-distributed environment. Nonetheless, this design introduces new problems, like maintenance of new Kafka brokers, extending the Hopsworks security domain to new machines, and requires much more resources to run. At this stage, it does not make a good design candidate for the IoT Gateway.

3.3.3 ProducerService

The ProducerService is responsible for the successful delivery of measurements to the Hopsworks Kafka cluster. It takes care of periodically polling the database, formatting the data, pushing it to Kafka, and cleaning up the database.

Key Selection An important aspect of the architecture is the right selection of the Kafka record key. According to [34], the key serves two goals: it is a piece of additional information that is stored along with the message, and it determines the partition that the message will be written to. In Kafka, the order of the messages is preserved only across partitions. Selection of the key has to consider this feature. Possible choices for

19 the Kafka key are:

• IoT Gateway Name - selecting the gateway name would preserve the order of the measurements making it a fairly good key. The drawback of this choice is visible when traffic comes from only a few gateways. It would lead to a case with unbalanced traffic putting a lot of weight on some Kafka partitions while others left with free resources.

• Node’s EndpointClientName - it is a good choice for a key as the order of measurements per device is still preserved and because of the huge amount of device, the traffic will be evenly spread across partitions.

• Node’s EndpointClientName and timestamp - Adding timestamp to the key will break the ordering as the measurements of the same device would be sent to multiple partitions. This choice does not add any value.

After looking at the pros and cons of all of the possible keys Node’s EndpointClientName looks like the best candidate.

Schema Selection Another design part to consider was how to split the data into Kafka Avro schemas. Solutions that were taken into consideration:

• Avro schema per IoT Gateway

– Benefits - devices that will be spread out evenly across Gateways will generate traffic that will be evenly spread across topics

– Drawbacks - a need to create one Avro schema for all types of LwM2M messages

• Avro schema per LwM2M IPSO object

– Benefits - each LwM2M message has a predefined schema and has a corresponding Kafka Avro schema (“schema follows schema”)

– Drawbacks - no way of redistributing traffic. If one type of sensor is heavily used over others then one topic will always be under a heavy load while others not. This is more of an engineering problem that can be tackled by introducing new brokers, more partitions, or other solutions that are out of the scope of this thesis.

20 After considering all the options, it was decided to take the second approach - an Avro schema per IPSO object. This approach feels natural, as all the Avro schemas will follow the OMA LwM2M specification. It will be easy to follow changes in the LwM2M protocol and upgrade the system to its future versions.

3.3.4 HopsworksService

HopsworksService is responsible for all communication between IoT Gateway and Hopsworks. It exposes API on localhost:12345/gateway/ (context root is gateway). The REST API table can be seen in table 3.1.

Method Endpoint Description GET / info about me GET /nodes list of currently connected IoT Nodes POST /nodes/{id}/blocked start blocking node with endpoint id DELETE /nodes/{id}/blocked stop blocking node with endpoint id POST /jwt post a new JWT token to connect to Hopsworks

Table 3.1: HopsworksService REST API.

3.4 Hopsworks Architecture

This section describes how the new components, IoT Gateway and IoT Node, are represented in Hopsworks and the changes made to Hopsworks in the frame of this thesis.

3.4.1 Hopsworks Database

To make the use of IoT Gateways simple, it was decided to make IoT Gateways a subresource of a project. A gateway is registered in a project and produces data that is accessible within the project it has been registered. This strict approach follows Hopsworks design concepts and provides full security. It has also a clear limitation of not being able to use a gateway in multiple projects. It was decided to still follow

21 this approach, as introducing a more flexible design would introduce many architectural complexities which are out of the scope of the topic and do not bring a significant value into the thesis. A user can register multiple gateways in a project, thus there is a one-to-many relation between projects and gateways. A gateway can be in one of the multiple states - UNREGISTERED, ACTIVE, BLOCKED, INACTIVE_BLOCKED. A state diagram can be seen in figure 3.4. A new gateways table in the Hopsworks database was introduced. The table contains

Figure 3.4: IoT Gateway state in Hopsworks. all the information necessary for Hopsworks such as id, IP address, port, etc. The table’s schema can be seen in figure 3.5.

22 Figure 3.5: New gateways table in Hopsworks database.

3.4.2 User Interface

The users interact with the IoT Gateways through a UI. It provides information about registered IoT Gateways and IoT Nodes. Users can see which nodes are connected to which gateways, register a new gateway, and block/unblock traffic from selected gateways.

3.4.3 IotGatewayResource

IoTGatewayService is responsible for interacting with all the IoT Gateways, anomaly detection services, and exposing the data to the Hopsworks UI. To make the data accessible it exposes an API, which can be found in table 3.2.

23 Method Endpoint Description POST /project/{id}/gateways/activateIot Activate IoT functionality for the project GET /project/{id}/gateways List of all IoT Gateways PUT /project/{id}/gateways Register a new gateway DELETE /project/{id}/gateways/{id} Unregister a gateway GET /project/{id}/gateways/{id} Get info about an IoT Gateway GET /project/{id}/gateways/{id}/nodes Get list of all IoT Nodes of an IoT Gateway GET /project/{id}/gateways/{id}/nodes/{id} Get info about a specific IoT Node POST /project/{id}/gateways/{id}/blocked Start blocking an IoT Gateway DELETE /project/{id}/gateways/{id}/blocked Stop blocking an IoT Gateway POST /project/{id}/gateways/{id}/nodes/{id}/blocked Start blocking an IoT Node DELETE /project/{id}/gateways/{id}/nodes/{id}/blocked Stop blocking an IoT Node

Table 3.2: IotGatewayResource REST API.

3.4.4 Data Storage

The incoming IoT data has to be persisted. The filesystem used in Hopsworks is a drop- in replacement of Apache Hadoop - HopsFS [3]. It is integrated into Hopsworks and will be used as the place for storing the measurements. The last actor that interacts with the data, Spark streaming jobs, will be responsible for writing the data into HopsFS.

3.4.5 Streaming Jobs

The incoming IoT data has to be processed in real-time to enable anomaly detection and prevent attacks, such as a DDoS attack. Anomaly detection can, however, be very tricky and complex, and exceeds the scope of the thesis. This project aims at providing support for future work on IoT anomaly detection that can be done by a team of data scientists. The streaming jobs are also the last step before storing the data in HopsFS. The requirements that streaming jobs have to meet are as follows:

• Adapt existing API of hops-util [35] to manage IoT Gateways and IoT Nodes

• Provide an example of anomaly detection based on the number of events

• Store data in HopsFS

24 3.5 IoT Nodes

An IoT Node is an IoT device collecting measurements and pushing them to cloud. It is typically a device with limited computing power, memory, and energy resources. It all creates a need for very lightweight protocols, ones that are specially adapted to the IoT world. During the design process, multiple protocols were considered, such us MQTT, Zigbee, and others. It was finally decided to use OMA LwM2M protocol [33] because of its robustness and increasing popularity. It is a light and modern protocol with a growing number of users in the industry. It was designed having in mind all the difficulties that the smallest devices can face in production use. The following subsections will describe how the devices are being identified and managed by the IoT Gateway.

3.5.1 Endpoint Client Name

According to the OMA LwM2M specification [33], the client identifies itself with a permanent Endpoint Client Name. It is constructed by the client using guidelines that follow the Uniform Resource Name (URN) standard [36]. There are multiple formats of a URN, each using a different property - Universally Unique Identifier (UUID), International Mobile Equipment Identity (IMEI), Media Access Control (MAC), and others. In the context of the project, it is not known in advance what kind of devices will connect to the IoT Gateway and it cannot be predicted what sort of interfaces will they be equipped with. It was decided that the devices will build their Endpoint Client Name using UUID URN - a format recommended by the OMA LwM2M specification. UUID is a 128-bit number used for information identification in computer systems. It is an open standard introduced by the Open Software Foundation. There is a non-zero probability of duplication of two UUID’s. Nonetheless, the probability is so small that can be safely discarded.

3.5.2 Measurements Timestamping

Most of the common use cases of IoT data is analysis and visualization in time-domain. It is thus very important to correctly timestamp the measurements. The OMA LwM2M standard does not include timestamping of the measurements. The

25 values from the sensors are sent to the server immediately after they were read without any buffering. The LwM2M protocol relies on CoaP and its mechanisms of retransmission and congestion control. According to Section 8.3 of [33], in case of network connectivity issues, CoaP will attempt to retransmit a message for a few times. If the message cannot be delivered, the information should be passed to the user application, however, it is out of the scope of the LwM2M specification. This mechanism can introduce delays in timestamping in the range of tens of seconds or even some lost measurements. However, the protocol is not used for time-critical systems or systems that require extremely high reliability. The drawbacks of the protocol are acceptable in the scope of the project. Another mechanism used by CoaP is congestion control. It limits the number of simultaneous outstanding interactions to one. The design benefits from it as the measurements will always be delivered in the correct order. Summarizing, it is sufficient to timestamp the measurements with the time of arrival on the gateway. The attached timestamp will later be passed to Hopsworks along with the measurement data.

3.5.3 Measurement Life Cycle

Each measurement travels through all components of the pipeline. Figure 3.6 presents the services through which the measurements pass.

Figure 3.6: Measurement life cycle.

Sensor, a part of an IoT Node, makes a measurement and sends it over the network to an IoT Gateway. It is received by the LeshanService and immediately stored in the database by the DatabaseService. The database is polled by the ProducerService for new records which are sent over the network to Kafka Broker running as part of

26 Hopsworks. The measurements are analyzed in real time by the Spark jobs and finally stored in HopsFs.

3.6 Security

This section describes how security is ensured throughout the whole pipeline. Figure 3.7 shows the system security design. There are two security domains. First

Figure 3.7: System Security Architecture. domain, the IoT Nodes connecting to the IoT Gateway. At the time of writing the thesis, Eclipse Leshan supported four security models - no security, Pre-Shared Key (PSK) [37], RPK [38], and X.509 [39]. No security model was discarded automatically. After initial discussions, it was decided to neither use X.509. Up to date, it is the best and the most advanced security model but it would require building a whole PKI, which is out of the scope of this thesis. To decide between PSK and RPK the difference in case of key exposure was compared. With RPK the server has to have only the public keys of the clients, while with PSK the server has to have the secret keys. If an attacker breaks into a server with RPK keys, it will be able to decrypt the communication but if the server will serve the PSK model, then the attacker will not only be able to do the same but also impersonate all the clients that have their keys stored on the server. It was thus decided to use RPK for the first security domain. All the data is encrypted and transported over Datagram Transport Layer Security (DTLS).

The second security domain contains the IoT Gateway and Hopsworks. During the registration process, the gateway downloads its PKI certificates which are later used

27 for the authentication. It also uses these certificates to while pushing IoT data to the Kafka broker. This security model is sufficient if it fulfills two requirements - all of the intermediaries are trusted and the performance degradation due to decrypting and again encrypting of the data by the IoT Gateway is not impacting the system significantly. Both of them are fulfilled. All the IoT Gateways register to Hopsworks and obtain their PKI certificates. Based on them they are considered trusted. Next, the data flow does not impact the performance of the system and does not create any bottlenecks. The solution is horizontally scalable and, in case of too many devices, new IoT Gateways can be easily added. To prevent attacks and feeding the models with possibly corrupted data, the system can decide to revoke access for a single Node or an IoT Gateway. Blocking a Node is done by the IoT Gateway, which is instructed to do it from the Hopsworks services. Blocking an IoT Gateway is done internally within Hopsworks with removing its credentials from Kafka’s ACLs. To prevent access from any IoT Gateway, the system can decide to remove user “IoT” from the project, effectively discarding all IoT traffic.

28 Chapter 4

Implementation

This section describes the implementation of each element of the architecture. It starts with software and hardware selected for the IoT Nodes. Later, it goes through the implementation of each service of the IoT Gateway. Lastly, it details the components implemented for Hopsworks.

4.1 IoT Nodes

For running simulations and experiments two implementations were used. First, a simulator was set up using an example [9] from Eclipse Leshan. It is a simple Java program, run through the command line interface from a JAR file available on the Leshan website. The command to run the simulator is: java -jar ./leshan-client-demo.jar

The simulator is capable of connecting to an IoT Gateway and sending a sample temperature measurements every two seconds. Second, a real IoT device was used to validate the pipeline. Thunderboard Sense 2 IoT Development Kit [10] was selected to run the experiments. It is an IoT development platform that provides a range of different sensors and an easy to use development platform. The device was running Contiki-NG operating system [12]. Contiki-NG provides an OMA LwM2M example application that is capable of connecting to a server and send measurements from the temperature sensor [13]. This application was used during the

29 evaluation process. The Sense 2 board uses IPv6 over Low-Power Wireless Personal Area Networks (6LoWPAN) technology to connect to the gateway. To enable the radio connection between the IoT Gateway and the Sense 2 board Zolertia Firefly board was used [11]. Both of the devices can be seen in figure 4.1.

Figure 4.1: Zolertia Firefly (top) and Thunderboard Sense 2 (bottom).

4.2 IoT Gateway

IoT Gateway was built using Akka [40]. It provides a toolkit for highly concurrent and message-driven applications for JVM. It is an implementation of the Actor Model [15]. Following the model, for each of the services a separate actor was created. Each actor is using helper classes that are described in the subsections below.

30 4.2.1 LeshanService

The LeshanService is implemented in package com.logicalclocks.iot.leshan in [41]. LeshanActor is the main class that extends the Akka’s Actor class. This class is the main actor in this service and it manages all of the other objects. It holds a reference to a server object, an instance of leshan.HopsLeshanServer class, which is an implementation of Eclipse Leshan server. The default address for accessing the user interface of Leshan server is localhost:8082. LeshanActor uses a few other helper classes. device.IotDevice is used to maintain the list of connected devices to the server. The package listeners defines listeners that subscribe to LWM2M messages in Eclipse Leshan. Currently, the subscribed messages are registration (through HopsRegistrationListener) and observations (HopsObservationListener). Depending on the use case more listeners can be added.

4.2.2 DatabaseService

All of the classes for DatabaseService are located in package com.logicalclocks.iot.db in [41]. The main class that represents the Akka Actor is DatabaseServiceActor. It uses a state machine with a list of actions to perform after each received message. The transitions and taken actions are listed in DatabaseServiceSm. Further, all interactions with the underlying database were extracted into an actor DbOutputsActor. It holds a reference to H2DatabaseController which is the only place to query and maintain the database. All the messages sent to DatabaseServiceActor, as well as between DatabaseServiceActor and DbOutputsActor, are listed in DomainDb. The last trait in this package, HopsDbController, contains signatures of all functions needed to perform required operations on a database. The trait makes heavy usage of asynchronous programming. All operations are designed to return either a Future or an OptionT, depending on the nature of the query. The trait defines database management operations like opening and closing a database, queries such as operations on a single record or a batch of records, and maintaining the table of blocked endpoints.

Slick The package com.logicalclocks.iot.db.slick in [41] contains the implementation of the service specific to the Slick framework. The object DbTables contains the definitions of all the tables and the relations between them.

31 The class H2DatabaseController is an extension of HopsDbController and implements all operations using the Slick framework. There is no in-memory state of blocked IoT Nodes. The only place to keep track of them is the database table BLOCKED_ENDPOINTS. The blocked devices have to be stored in the database so introducing an in-memory state would only complicate the maintenance process.

4.2.3 ProducerService

ProducerService implementation can be found in com.logicalclocks.iot.kafka package in [41]. The main logic is implemented in ProducerServiceActor. It holds state for Hopsworks PKI certificates, Kafka topics’ Avro schemas, a custom implementation of Kafka Producer, and a reference to a Cancellable holding the task scheduled to poll the database. The ProducerService starts as an idle process awaiting messages from other actors. It needs PKI certificates for secure connection to the Kafka cluster, and Avro schemas to be able to produce messages to Kafka topics. Once received, ProducerService starts polling the database for any new measurements. They are sent to Kafka and a callback is invoked once Kafka confirms reception. That way at-least-one semantics are provided. The certificates received from Hopsworks are encoded with . Decoding and saving to a file is done with HopsFileWriter.

4.2.4 HopsworksService

The code for HopsworksService can be found in com.logicalclocks.iot.hopsworks package in [41]. The class serving the role of the main actor is called HopsworksServiceActor. It keeps a reference to instances of two classes - HopsworksClient and HopsworksServer. The first one is responsible for HTTP requests to Hopsworks. It has two methods defined - one for downloading PKI certificates for the project and another for downloading Avro schemas of Kafka topics. The latter extends HopsworksService and exposes the gateway REST API. It handles all the incoming HTTP traffic. Both classes are implemented using Akka Http framework [40]. It is part of the Akka ecosystem and works very well with the already used Akka Actor framework.

32 4.3 Hopsworks

This section describes the implementation of the new Hopsworks services and changes to existing ones, that were introduced to serve the IoT traffic. It also goes into details of the custom UI and streaming jobs used in this project.

4.3.1 Hopsworks Database

A new table gateways was introduced to the Hopsworks database. The design of the table can be seen in figure 3.5. The table schema was included in hopsworks-chef bootup files in [42].

4.3.2 IoTGatewayResource

The IotGatewayResource service is maintaining every IoT API request. The implementation can be found in packages io.hops.hopsworks.api.iot and io.hops.hopsworks.common.dao.iot in [43]. It follows the design patterns and extends the models incorporated by Logical Clocks. The service uses IotGatewayController to perform all actions and IotGatewayFacade to perform operations on the database. All of them are defined as part of the model in IotGateways. In some cases, the service works as a proxy passing the requests to appropriate IoT Gateway. It uses the Apache Http library [44] to communicate with the gateways. A sample sequence diagram for getting the list of the IoT Nodes can be seen in figure 4.2.

33 Figure 4.2: Sequence diagram of a REST call getting the list of IoT Nodes.

The process starts with IotGatewayResource calling getNodesOfGateway on IotGatewayController. The controller fetches needed data from the database using IotGatewayFacade and sends an HTTP request to the gateway. After receiving a response it builds IotDeviceDTO object with the help of IotGatewayBuilder. Eventually, this object is returned to the user.

4.3.3 User Interface

The codebase for the UI added to support IoT Gateways is an extension of hopsworks-web in [43]. The UI was implemented using AngularJS [20] framework. Users access the UI through section Kafka and tab IoT. Screenshots showing capabilities of the UI are shown in figures 4.3, 4.4, 4.5, and 4.6. Figure 4.3 shows the window for entering details of a new IoT Gateway. Figure 4.4 presents the overview of the IoT tab with listed IoT Gateways and buttons to expand the view. Figure 4.5 shows the window for presenting the details of an IoT Gateway. Finally, figure 4.6 displays a window with the list of currently connected IoT Nodes.

34 Figure 4.3: UI - Enter IoT Gateway Details window.

Figure 4.4: UI - Overview of IoT tab.

Figure 4.5: UI - IoT Gateway Details window.

35 Figure 4.6: UI - IoT Nodes window.

4.3.4 Streaming Jobs

All of the streaming jobs were developed as Apache Spark [45] jobs in Hopsworks. Apache Spark is an analytics engine for large-scale data processing. It provides support for real-time stream processing and is a popular choice for anomaly detection. The next paragraphs describe the process of writing jobs. hops-util changes The hops-util library required support for managing the IoT Gateways and IoT Nodes. The operations that the jobs have to be able to perform are blocking and unblocking an IoT Gateway, and blocking and unblocking a single IoT Node. The implemented functions are blockIotGateway and blockIotNode and can be found in an updated version of hops-util in [46]. They perform HTTP requests to Hopsworks REST API that is described in section 3.4.3. The documentation of the functions can be found in the source code.

Anomaly Detection Anomaly Detection for IoT is a wide topic itself. This project provides a generic framework for future implementations. It provides tools necessary to

36 properly react after detecting an anomaly (described in the previous paragraph). A sample anomaly detection job can be found in com.logicalclocks.iot.spark.TrafficDetectionOneTopic.scala in [47]. The job counts the number of events in a time window of each IoT Gateway per Kafka topic. After reaching a predefined threshold, traffic from a particular gateway is blocked.

Storing IoT data in HopsFS The job responsible for storing data in HopsFS is called StoreIotDataInHopsFs and can be found in [47] in package com.logicalclocks.iot.spark. The job subscribes to all LWM2M topics and stores the data in Parquet format [48] in HopsFS. It does not need to worry about misbehaving devices and other forms of attack because of other jobs that are already blocking poisonous traffic from these gateways and nodes.

4.4 Installation

This section explains how to set up the environment needed to run an IoT pipeline. There are two main components of the setup - Hopsworks and IoT Gateway. One should start with Hopsworks and proceed with IoT Gateway.

Hopsworks installation Before installation users should be familiar with the Hopsworks environment and the regular installation process that can be found in [49]. To run the Hopsworks virtual machine users should clone release-iot-thesis branch of [50]. The run command is: ./run.sh ubuntu 1 hopsworks

The virtual machine needs to be run on a publicly accessible network (with a public IP address) so it is reachable by the gateways. The environment automatically pulls custom versions of hopsworks-chef, hopsworks, and hops-util. Next step is to create Kafka schemas and topics for OMA LwM2M messages. The Avro schemas can be found in directory avro in [41]. List of topics can be found in src/main/scala/com/logicalclocks/iot/kafka/LwM2mTopics.scala in [41]. The topics should follow the name values, e.g. topic-lwm2m-3303-temperature. Topic creation should follow by going to the IoT tab and activating the IoT feature. Lastly, the Spark jobs should be started - one to save measurements to HopsFS and

37 one, optional, for anomaly detection. The code for the jobs can be found in package com.logicalclocks.iot.spark in [47]. Once the jobs are started, Hopsworks is ready to receive IoT data.

IoT Gateway The IoT Gateway code is accessible in the branch release-0.1 in [41]. The network configuration file can be found in src/main/resources/application.conf. The gateway section sets the REST API endpoint. The hopsworks section tells where the gateway can find Hopsworks. The leshan section sets the sockets for CoaP and DTLS-Secured Constrained Application Protocol (CoaPS) communication, as well as the Leshan UI address. The kafka section holds all the Kafka producer settings. The gateway is ready to start out-of-the-box. The command to run: java -Dconfig.file=application.conf -jar Hops IoT Gateway-assembly-0.1.0-SNAPSHOT.jar

The new gateway has to be added to Hopsworks through the UI. Once it is added, the gateway is ready to accept any OMA LwM2M device and successfully push the data to Hopsworks.

38 Chapter 5

Evaluation

The evaluation of the project was performed in three steps - verification, validation, and benchmarking. The following subsections describe each of the steps in details.

5.1 Verification

The progress of the project was verified on an at least weekly basis by the company’s supervisor. Each verification included making sure the built software satisfies requirements specified by Logical Clocks, code reviews, and if needed, a short demonstration of implemented functionality. In case a feature required changes, during the next meetings it would be discussed again until eventually being approved. Based on this process, an internal design document was prepared that had to be submitted and approved by Logical Clocks. The code was verified with unit tests. The unit test coverage of the IoT Gateway [41] was 45% which is a good value proving correct functionality of the most important features.

5.2 Validation

After the verification part and finishing the software development process, the software was validated whether it satisfies all the specified requirements. Below is the list of all validation steps and their results.

39 5.2.1 Test setup

For running the tests two machines were used - a remote machine called bbc2 and a local machine computer. The specifications of the machines can be found in tables 5.1 and 5.2. Versions of the software used for all the tests can be found in table 5.3. Instructions for

Operating System CentOS Linux release 7.3.1611 CPU Model Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz CPU cores 32 RAM 252 GB

Table 5.1: bbc2 test machine specifications.

Operating System Ubuntu 18.10 Cosmic Cuttlefish CPU Model Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz CPU cores 4 RAM 15 GB

Table 5.2: computer test machine specifications. setting up Hopsworks and IoT Gateway can be found in section 4.4.

hopsworks-iot iot-thesis karamel-chef iot-thesis hopsworks-chef iot-thesis hopsworks-iot-streaming-jobs master hops-util iot-thesis

Table 5.3: Software branches used for tests.

5.2.2 Test with an IoT simulator

This test validated if the whole pipeline works as expected in the simplest possible setup. The test is passed if the data generated by the simulator can be visualized in Hopsworks, thus proving that the data goes through every element of the pipeline and is safely stored in HopsFS. The setup included Hopsworks (bbc2), IoT Gateway (computer) and IoT Node simulator (computer).

40 Figure 5.1: Screenshots of running IoT Gateway (top) and IoT Node simulator (bottom).

Screenshots and results from the test can be seen in figures 5.1, 5.2, and 5.3. Figure 5.1 shows a screenshot from the running setup. On the top are present messages from the print out of a running IoT Gateway. On the bottom, a console of a running IoT Node simulator connected to the gateway.

Figure 5.2: Screenshot of running Eclipse Leshan server.

Figure 5.2 is a screenshot of the running Leshan server, a part of the IoT Gateway. It can be seen that the gateway is observing temperature measurements from the device.

41 The last part of the pipeline, storing data in HopsFS, was a Spark Streaming job StoreIotDataInHopsFS. The detailed description of the job can be found in section 4.3.4.

Figure 5.3: IoT simulator data retrieved from HopsFS.

The last figure in this section, figure 5.3, shows a graph of the simulator data. The graph was built with a script that can be found in [47] in notebooks/test-1-data-visualization.ipynb. Figure 5.3 proves that the pipeline is working correctly and that the test has passed. It is possible to successfully receive data from an IoT simulator, store, and process it using the platform.

5.2.3 Test with a real IoT device

A test setup was prepared used IoT devices listed in section 4.1. A Thunderboard Sense 2 board was used as an IoT device. It was running the contiki-ng operating system with an OMA LwM2M demo application [13]. The IoT Gateway was set up on the computer machine with a connected Zolertia Firefly to access the wireless network. Hopsworks was run on the bbc2 machine. The IoT device was able to successfully connect to the IoT Gateway and push the measurements to Hopsworks. The data was stored in HopsFS and easily accessed for future analysis.

42 This test proved the feasibility of the project in a real-life environment, as well as compatibility with OMA LwM2M protocol across different implementations.

5.2.4 Multiple gateways test

This test validates if an IoT Node can upload measurements to Hopsworks through different gateways. It also checks if Hopsworks correctly recognizes the node despite it changes the gateways. The setup of the test includes Hopsworks virtual machine running on the bbc2 machine, two IoT Gateways and an IoT Node running on the computer machine. The two gateways shared the same machine but had different IDs (11000 and 11001) and ran on different ports. The test was run by starting Hopsworks VM and connecting the gateways. Once made sure that the gateways started their Kafka producers, the IoT Node connected to the first gateway, started producing measurements, shut down, connected to the second gateway, and again started sending data. The data was stored to HopsFS and analyzed with the notebook notebooks/Two-Gateways-Test.ipynb from [47]. All the records were correctly recognized. Regardless of the gateway id, they were stored in the same HopsFS directory (which is based on endpointClientName).

5.2.5 Failure test

The system has to be resistible to many different unwanted and unexpected situations. Particularly, it has to be able to recover from sudden failures of different segments of the pipeline. This test checks how the IoT Gateway behaves in critical situations. After setting up the whole system, failures of IoT Gateway and Kafka broker were simulated. The gateway was simply killed and brought back, while the Kafka broker was restarted using systemctl restart kafka command. The notebook and results can be found in [47] in notebooks/Failure-Test. The IoT Node simulator was programmed to send values from 21.0 increasing by 1.0 with every measurement. It can be seen that two ranges of values are missing. The first, between 26.0 and 34.0, was a short kill of the process running the gateway. The gateway was started right away after its failure. The second, between 40.0 and 64.0, was a longer outage allowing the IoT Node to

43 enter a new state in which it fails to reconnect to the LWM2M server. After the gateway was brought back, the node detected its presence and immediately tried to update its registration. The registration update was refused by the gateway. The node was forced to start a new registration process which was completed successfully and the node again started sending measurements. The measurements during the gateway’s outages were lost and are not recoverable. This is caused by the fact that the IoT Nodes are not buffering their measurements. This is the direct effect of how OMA LwM2M protocol is designed and is described in details in section 3.5.2. During the next part of the test, the Kafka cluster was restarted. It took around 10 seconds to bring back the Kafka broker online. Nevertheless, it did not influence the final dataset. The measurements were buffered on the IoT Gateway and, once the connection was re-established, they were all successfully sent and stored in HopsFS.

5.2.6 Anomaly Detection Test

The job described in 4.3.4 was run in Hopsworks. Next, an IoT Gateway was registered along with ten IoT Node simulators to reach the threshold of the event. After starting the processing of the events, the gateway was blocked. The domain of the gateway was added to the Kafka ACL with permission deny, making it unable to push new measurements (see figure 5.4).

Figure 5.4: Kafka ACL after detection of too high traffic on a gateway.

The effects can also be seen on the gateway. For every new measurement sent to Kafka, it started throwing an exception: ERROR .l.iot.kafka.HopsKafkaProducer - Exception sending 1080 from Kafka: org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [topic-lwm2m-3303-temperature].

44 5.3 Benchmarking

After the validation process, the software was benchmarked. The goal was to see how fast the IoT Gateway delivers measurements to the cloud and how it behaves under growing traffic. The benchmark measured the time a measurement took from an IoT Gateway to a Kafka broker in Hopsworks. The delivery time was calculated as a difference between the timestamp of the arrival of a measurement on the gateway and the time of arrival on a Kafka broker. The first timestamp is taken by the gateway as described in section 3.5.2. The latter, automatically by the Kafka broker. Both of the measurements are available in the Kafka message.

5.3.1 Latency in a local setup

The setup for the benchmark included Hopsworks virtual machine, IoT Gateway, and IoT node simulators running on the bbc2 machine in Sweden. The same data pipeline was prepared as in section 5.2.2. The analysis can be found in notebooks in [47]. The visualization of the results can be seen in figure 5.5. For one IoT simulator data from 1155 measurements was collected. The average latency was 12 ms. The second test included twenty IoT devices and 6033 measurements were collected. This time, the average latency grew to 26 ms. The increase in the latency was caused by the unfortunate sequence with which the nodes pushed the measurements to the gateway. In this sequence, one record would always be excluded from the batch of records sent to Kafka and would have to wait additional 300 ms for the next sequence of records (the simulators send measurements with a fixed interval of two seconds). Both of the tests had considerably large latencies for the first records. It is caused by the boot process of some services of the IoT Gateway. With the first record to send, the gateway starts Kafka producer which establishes connections with the Kafka broker. This is expected to take over a second hence the delay of the very first measurements. These records were filtered out before plotting to increase visibility.

45 (a) One IoT simulator. (b) Twenty IoT simulators.

Figure 5.5: Measurement delivery time for local setup.

Tests with more simulators were not run with this setup because of the high resources usage. The bbc2 machine was a machine shared with more users and it was not advised to increase the usage to a higher level.

5.3.2 Latency in a remote setup

In the majority of the production environment, the IoT Gateways would be located far away from the servers. To better simulate these installations the following benchmarks were done with an on-distance setup. The Hopsworks virtual machine was still running on the bbc2 machine in Stockholm, Sweden but the IoT Gateway was running on computer machine located in Spain. Three benchmarks were performed: with one, twenty, and hundred IoT Nodes. Data from 1181, 3240, and 15613 measurements were collected respectively.

46 (a) One IoT simulator. (b) Twenty IoT simulators.

(c) Hundred IoT simulators.

Figure 5.6: Measurement delivery time for remote setup.

The results and analysis can be found in notebooks in [47]. They are very satisfying. The average latency time for one node setup was 15 ms. It grew to 94 ms for twenty nodes, and to 371 ms for a hundred nodes. It is important to note that in all cases it took a significant amount of time for the gateway to recover from the booting process. The gateway takes time to start the Kafka producer and establish all necessary connections while being flooded by the nodes with measurements which are being buffered. Excluding the bootup process from the calculations would significantly reduce the results.

5.3.3 Latency results analysis

The comparison of the average latency for local and remote setup is shown in table 5.4 and figure 5.7. All of the results are below 400 ms which is better than expected.

Number of Nodes local [ms] remote [ms] 1 12 15 20 26 94 100 371 -

Table 5.4: Average latency benchmark results

47 Figure 5.7: Average latency benchmark result comparison.

It is expected to reduce the average greatly if the gateway would not perform a cold boot for each test. The IoT Gateway is designed to be a long running process so the process of starting the Kafka producer would not interrupt its work as much. Nonetheless, the gateway is handling growing traffic well and can efficiently deliver measurements to a Kafka broker even when running in a large physical distance to the servers.

5.3.4 Cold and warm startup

In all previous tests, the very first values used to be much bigger than the later ones. After some time, the graphs used to stabilize with a very short latency. In this chapter, two tests were performed with twenty IoT Nodes. During the first test, data was collected from the nodes with a cold startup, meaning the gateway was restarted just before the devices started pushing data. The second, during a warm startup, where some previous devices pushed data to the gateway and it has already established connections with Kafka broker. The results of both tests are shown in figure 5.8 and [47]. It can be seen that the initial phase of the cold startup (fig. 5.8a) stands out significantly. The first records have a very high latency, which, over time, decreases. The warm startup (fig. 5.8b) does not have this initial phase. From the very first record, the graph looks stable. The data was collected for 150 measurements both with cold and warm startup. The average latency with cold startup was 97 ms, while with warm startup only 63 ms.

48 (a) Measurement latency with cold startup. (b) Measurement latency with warm startup.

Figure 5.8: Measurement latency with cold and warm startup.

49 Chapter 6

Conclusion

This chapter summarizes the work done in this project. It reviews if the goals were achieved, presents the main areas of future work, and provides final reflections.

6.1 Goals Achieved

The project met all the set goals. It was empirically proven that the project is feasible at its scope. Support for IoT data ingestion was provided through IoT Gateway and extending Hopsworks. Security was ensured by the use of the HTTPS and RPK protocols. On top of that, the gateways were authenticated using JWT. Besides, the hops-util library was extended to provide tools for the exclusion of misbehaving devices and blocking traffic from sources of DDoS attack. Sample streaming jobs were provided to test the added functionality. Moreover, multiple tests were run to prove the reliability of the system and the ability to recover from potentially harmful situations like a power outage, unexpected reboot of the machines, and others. The system demonstrated its resilience and capability to return after a collapse of any of the elements. In addition, the IoT Gateway was tested against bigger traffic on a scale that the test machines were able to simulate. It was shown that the gateway can deliver data fast enough and in a reliable manner. The gateway generally performed very well, however, some parts, like the DatabaseService, can be optimized thus making the gateway work faster under heavy traffic. Lastly, examples of streaming analytics jobs were presented to visualize the measurements. The data was correctly retrieved from storage, processed and shown in a graphical form.

50 6.2 Future Work

The scope of the project was limited because of time constraints. To meet both the project requirements and deadlines some simplification were introduced. The following elements are expected to be further developed to make sure the system is production-ready:

• The OMA LwM2M protocol was implemented only in terms of two types of messages - temperature and presence. It is advised to implement the rest of the IPSO objects to make the IoT Gateway fully compliant with the protocol.

• Currently, the IoT Nodes are provided with the hostname and port of the IoT Gateway. To make the system truly scalable, a bootstrap server needs to be introduced. It would contain the list of active IoT Gateways and would redirect the nodes to the optimal one. In other words, the bootstrap server would server the role of a load balancer. This would also ease the usage of hostnames instead of IP addresses which would make the system much more flexible. In this case, the gateways would perform a DNS lookup.

• Extracting gateways as a separate resource not bounded to a single project would highly extend flexibility and ease the analysis of the data. Currently, the gateways are a subresource of a project and only the stored datasets can be shared between projects.

• The Hops Kafka Authorizer currently supports access based on the IP address. In the case of a Network Address Translation (NAT), it creates a conflict between the gateways. Blocking one gateway would potentially block a whole range of gateways. Adding authorization based on the port would mitigate the problem.

• The work done in this project provides tools for the automatic exclusion of the devices and/or gateways. The next step would be to develop a real ML model that could protect the Hops platform against DDoS attacks.

• Another approach to data ingestion would be to make the IoT Nodes push the data directly to a Kafka broker. It would require a complete redesign of the system but could potentially enable end-to-end PKI security. This design would also require the deployment of Kafka brokers not only in the main data center but also in the field introducing new challenges.

51 6.3 Reflections

It was shown that the IoT Gateway and Hopsworks IoT extension work as expected. We were able to connect real IoT devices and stream the data to the cloud in a secure, performant, and reliable manner. The gateway was designed with a flexible architecture so, by replacing LeshanService, the system can be extended to other IoT protocols, such as MQTT. The code developed in this thesis is fully open-source and free to use and distribute under the GNU v3.0 license. It was not, however, tested in a production environment. The system would be required to go through an exhaustive quality assurance phase before being installed with a real-life IoT network. We hope that the work conducted in this thesis will be the subject of further research and development in a production environment and that the extended Hops platform will open new possibilities of data analysis to researchers, companies, and organizations.

52 Bibliography

[1] Ericsson. Internet of Things forecast - Ericsson Mobility Report. June 2019. URL: https : / / www . ericsson . com / en / mobility - report / internet - of - things - forecast (visited on 06/21/2019).

[2] Apache Software Foundation. Apache Hadoop. June 2019. URL: https://hadoop. apache.org/ (visited on 06/21/2019).

[3] Hops. Infrastructure for ML - The Data Platform for AI. June 2019. URL: https: //www.hops.io/ (visited on 06/21/2019). [4] Salman Niazi et al. “Hopsfs: Scaling hierarchical file system metadata using newsql databases”. In: 15th {USENIX} Conference on File and Storage Technologies ({FAST} 17). 2017, pp. 89–104. [5] Apache Software Foundation. Apache Kafka - A distributed streaming platform. June 2019. URL: https://kafka.apache.org (visited on 06/21/2019).

[6] Logical Clocks AB. The Makers of Hops and Hopsworks. June 2019. URL: https: //logicalclocks.com (visited on 06/21/2019). [7] Noun Project. Cloud by Saifurrijal from the Noun Project, modem by Kavya from the Noun Project, Washing Machine by Manop Leklai from the Noun Project, Surveillance Camera by Noura Mbarki from the Noun Project, IoT by SBTS from the Noun Project. June 2019. URL: https : / / thenounproject . com/ (visited on 06/21/2019).

[8] OMA SpecWorks. Lightweight M2M (LWM2M). June 2019. URL: https : / /www . omaspecworks . org / what - is - oma - specworks / iot / lightweight - m2m - lwm2m/ (visited on 06/21/2019).

[9] Eclipse Foundation. Eclipse Leshan. June 2019. URL: https://www.eclipse.org/ leshan/ (visited on 06/21/2019).

53 [10] Silicon Labs. IoT Development Kit - Thunderboard Sense 2. June 2019. URL: https : / / www . silabs . com / products / development - tools / thunderboard / thunderboard-sense-two-kit (visited on 06/21/2019).

[11] Zolertia. FIREFLY - Zolertia. June 2019. URL: https://zolertia.io/product/ firefly/ (visited on 06/21/2019).

[12] Contiki-NG. The OS for Next Generation IoT Devices. June 2019. URL: https : //contiki-ng.org/ (visited on 06/21/2019). [13] Contiki-NG. Tutorial: LWM2M and IPSO Objects - contiki-ng/contiki-ng Wiki. June 2019. URL: https://github.com/contiki- ng/contiki- ng/wiki/Tutorial:- LWM2M-and-IPSO-Objects (visited on 06/21/2019). [14] École Polytechnique Fédérale Lausanne (EPFL). The Scala Programming Language. June 2019. URL: https : / / www . scala - lang . org/ (visited on 06/21/2019).

[15] Wikipedia. Actor model. June 2019. URL: https : / / en . wikipedia . org / wiki / Actor_model (visited on 06/21/2019).

[16] H2 Database. H2 Database Engine. June 2019. URL: http://www.h2database. com/html/main.html (visited on 06/21/2019).

[17] Lightbend. Slick. June 2019. URL: http : / / slick . lightbend . com/ (visited on 06/21/2019).

[18] Logical Clocks AB. What is Hopsworks? June 2019. URL: https://hopsworks. readthedocs.io/en/0.9/overview/introduction/what-hopsworks.html (visited on 06/21/2019). [19] JAX-RS. Java API for RESTful Web Services (JAX-RS) delivers API for RESTful Web Services development in Java SE and Java EE. June 2019. URL: https : //github.com/jax-rs (visited on 06/21/2019).

[20] AngularJS. Superheroic JavaScript MVW Framework. June 2019. URL: https:// angularjs.org/ (visited on 06/21/2019).

[21] Chef. Chef Docs. June 2019. URL: https : / / docs . chef . io/ (visited on 06/21/2019).

[22] Apache Software Foundation. Apache Avro. June 2019. URL: https : / / avro . apache.org/ (visited on 06/21/2019).

54 [23] Apache Software Foundation. Apache Avro 1.9 Getting Started. June 2019. URL: https : / / avro . apache . org / docs / current / gettingstartedjava . html % 5C # Defining+a+schema (visited on 06/21/2019). [24] M Jones and J Bradley. N. sakimura," web token (jwt). Tech. rep. RFC 7519, DOI 10.17487/RFC7519, May 2015,< http://www. rfc-editor. org/info . . ., 2012.

[25] Wikipedia. Public key infrastructure. June 2019. URL: https://en.wikipedia.org/ wiki/Public_key_infrastructure (visited on 06/21/2019).

[26] Wikipedia. Embedded database. June 2019. URL: https://en.wikipedia.org/ wiki/Embedded_database (visited on 06/14/2019).

[27] influxdata. Time series database (TSDB) explained. June 2019. URL: https : / / www.influxdata.com/time-series-database (visited on 06/21/2019).

[28] Amazon. What Is a Key-Value Database? June 2019. URL: https://aws.amazon. com/nosql/key-value/ (visited on 06/21/2019).

[29] Oracle. A Relational Database Overview. June 2019. URL: https://docs.oracle. com/javase/tutorial/jdbc/overview/database.html (visited on 06/21/2019).

[30] Apache Software Foundation. Apache Derby. June 2019. URL: https : / / db . apache.org/derby/ (visited on 06/21/2019).

[31] HyperSQL. HSQLDB. June 2019. URL: http : / / hsqldb . org/ (visited on 06/21/2019).

[32] H2 Database. Performance. June 2019. URL: http://www.h2database.com/html/ performance.html (visited on 06/21/2019). [33] Open Mobile Alliance. Lightweight Machine to Machine Technical Specification. Tech. rep. Specification. https://openmobilealliance.org/release/LightweightM2M/V1_0_2-20180209- A/OMA-TS-LightweightM2M-V1_0_2-20180209-A.pdf: OMA, Feb. 2018. [34] Neha Narkhede, Gwen Shapira, and Todd Palino. Kafka: The Definitive Guide: Real-time Data and Stream Processing at Scale. " O’Reilly Media, Inc.", 2017.

[35] Logical Clocks AB. hops-util: Utility Library for Hopsworks. June 2019. URL: https: //github.com/logicalclocks/hops-util (visited on 06/21/2019). [36] P Leach, Michael Mealling, and Rich Salz. “RFC 4122: A universally unique identifier (UUID) URN namespace”. In: Proposed Standard, July (2005). [37] Dan Harkins. Secure Pre-Shared Key (PSK) Authentication for the Internet Key Exchange Protocol (IKE). Tech. rep. 2012.

55 [38] P Wouters et al. “RFC 7250, Using Raw Public Keys in Transport Layer Security (TLS) and Datagram Transport Layer Security (DTLS)”. In: Internet Engineering Task Force (2014). [39] David Cooper et al. “RFC 5280: Internet X. 509 public key infrastructure certificate and certificate revocation list (CRL) profile”. In: IETF, May (2008). [40] Lightbend. Akka: build concurrent, distributed, and resilient message-driven applications for Java and Scala. June 2019. URL: https://akka.io/ (visited on 06/21/2019).

[41] Logical Clocks AB. hopsworks-iot. June 2019. URL: https : / / github . com / logicalclocks/hopsworks-iot (visited on 06/21/2019).

[42] Kajetan Maliszewski. hopsworks-chef at iot-thesis. June 2019. URL: https : / / github.com/kai-chi/hopsworks-chef/tree/iot-thesis (visited on 06/21/2019).

[43] Kajetan Maliszewski. hopsworks at iot-thesis. June 2019. URL: https://github. com/kai-chi/hopsworks/tree/iot-thesis (visited on 06/21/2019).

[44] Apache Software Foundation. Apache Http Components. June 2019. URL: https: //hc.apache.org/ (visited on 06/21/2019). [45] Apache Software Foundation. Apache SparkTM - Unified Analytics Engine for Big Data. June 2019. URL: https://spark.apache.org/ (visited on 06/21/2019).

[46] Kajetan Maliszewski. hops-util at iot-thesis. June 2019. URL: https://github.com/ kai-chi/hops-util/tree/iot-thesis (visited on 06/21/2019).

[47] Kajetan Maliszewski. hopsworks-iot-streaming-jobs. June 2019. URL: https : / / github.com/kai-chi/hopsworks-iot-streaming-jobs (visited on 06/21/2019).

[48] Apache Software Foundation. Apache Parquet. June 2019. URL: https : / / parquet.apache.org/ (visited on 06/21/2019).

[49] Logical Clocks AB. Hopsworks Documentation 0.9. June 2019. URL: https : / / hops.readthedocs.io/en/0.9/ (visited on 06/21/2019).

[50] Kajetan Maliszewski. karamel-chef. June 2019. URL: https://github.com/kai- chi/karamel-chef/tree/iot-thesis (visited on 06/21/2019).

56