Design of a Distributed System for Information Processing in a Mist-Edge-Cloud Environment

Moisés Carral Ortiz

Master’s Thesis presented to the Telecommunications Engineering School Master’s Degree in Telecommunications Engineering

Supervisors Felipe José Gil Castiñeira Juan José López Escobar

2020 - 2021

Acknowledgements I would like to express my gratitude to my tutor, Felipe Gil, who has allowed me to discover the unknown world of research and has supported me both professionally and personally to make this thesis what it is.

Also I would like to consider the good working environment created by my co-workers in GTI researh group.

Finally, thanks to my virtuous and tireless master’s degree teammates.

Vigo, July 20, 2021

iii

Abstract The exponential growth of connected devices as well as massive data generation are challenging well-established technologies such as Cloud Computing. The Big Data paradigm for inferring knowledge and predicting results supports the sensorisation of environments in all areas (social, business, environmental, etc.) to obtain more information for the training of artificial intelligences or machine learning models.

The need for low latency and high throughput applications in sensor networks is driving the emergence of new computational paradigms and network architectures. Real-time systems in which decision-making is critical, such as autonomous driving or the management of automated processes by robots in Industry 4.0, cannot adapt their needs to Cloud Computing. Other paradigms such as Mist Computing or are emerging in response to these needs to evolve actual computing architectures.

This paper introduces some of the proposals that, in contrast to the classical structure of centralised data processing in the Cloud, advocate for a distributed architecture that brings the IoT-Edge-Cloud Continuum. This master thesis is part of the collaboration of the Grupo de Tecnologías de la Información research group in the European 2020 NextPerception project. The aim of this collaboration is the creation of an intelligent distributed platform. Specifically, this thesis is focused on on the design of a distributed network communication architecture, proposing, verifying and contributing to a novel technology called Zenoh. The reader will understand how the protocol works and will understand the contributions we made towards improving it.

Key words: Zenoh, Data-centric, Publish, Subscribe, Fog computing, Mist computing, Distributed Computing

v

Contents

Acknowledgements iii

Abstract v

List of figures xi

1 Introduction1 1.1 Context...... 1 1.2 NextPerception...... 2 1.3 Objectives...... 3

2 State of the art5 2.1 Introduction...... 5 2.2 From centralised to distributed architectures...... 6 2.2.1 Centralised architectures...... 6 2.2.2 Hierarchical architectures...... 7 2.2.3 Distributed architectures...... 9 2.3 Data distribution...... 11 2.4 Embedded real-time operating systems...... 15

3 Design of a distributed computing architecture 17 3.1 Fusion strategies and computational load distribution...... 18 3.2 Architecture...... 20

4 Data distribution 27 4.1 Zenoh...... 27 4.2 Zenoh protocol...... 31 4.3 Zenoh with RTOS...... 34 4.4 Wireshark dissector...... 35 4.5 Zenoh in NextPerception...... 38

5 Results and validation 41 5.1 Metrics...... 41 5.1.1 Throughput...... 42

vii Contents

5.1.2 Latency...... 46 5.2 Zenoh over BLE...... 47

6 Conclusions 51 6.1 Future work...... 52

Bibliography 53

Appendices 57

Zenoh demonstrator 59 A.1 Prerequisites...... 59 A.1.1 OpenCV...... 59 A.1.2 Rust nightly...... 61 A.1.3 Zenoh-python...... 61 A.2 Deployment...... 61 A.2.1 Peer deployment...... 61 A.2.2 Mixed topology...... 65 A.3 Zenoh router installation...... 70 A.3.1 Linux (x86_64)...... 70 A.3.2 ARM (arm64)...... 70

Protocol Messages 73 B.1 Zenoh Messages...... 73 B.2 Session Messages...... 74

viii Acronyms ABS Antiblockiersystem.

AI Artificial Intelligent.

AMQP Advanced Message Queuing Protocol.

ANSI American National Standards Institute.

BLE Bluetooth Low Energy.

CDN Content Delivery Network.

CoAP Constrained Application Protocol.

CPU Central Processing Unit.

DSS Data Distribution Service.

FAT File Allocation Table.

FIM Fog Infrastructure Manager.

GATT Generic Attribute Profile.

GPIO General Purpose Input/Output.

HLC Hybrid Logical Clock.

IDEs Integrated Development Environmets.

IoT .

IP Internet Protocol.

IT Information Technology.

KVM Kernel-based Virtual Machine.

ix Acronyms

LED Light-Emitting Diode.

LSB Least Significant Bit.

LXC LinuX Containers.

MANO Management and Network Orchestration.

ML Machine Learning.

MPUs Mutiple Process Units.

MQTT Message Queuing Telemetry Transport.

MSB Most Significant Bit.

NDN Named Data Networks.

NFV Network Functions Virtualization.

OT Operation Technology.

POSIX Portable Operating System Interface.

QoS Quality of Service.

QUIC Quick UDP Internet Connections.

RAM Random Access Memory.

REST Representational State Transfer.

RPC Remote Procedure Call.

RTOS Real-Time Operating Systems.

TCP Transmission Control Protocol.

TPU Tensor Processing Unit.

UUID Universally Unique IDentifier.

V2V Vehicle to Vehicle.

V2X Vehicle to Everything.

x List of Figures

1.1 Basic architecture of the UC1 from NextPerception...... 4

2.1 Processing location selection...... 7 2.2 Fog Cloud Computing architecture [8]...... 8 2.3 Distributed computing architecture...... 9 2.4 Event based information delivery model...... 12 2.5 DDS communication scheme [14]...... 13 2.6 Protocols comparison table [17]...... 14

3.1 Complete architecture [27]...... 22 3.2 Design of NextPerception stack...... 25 3.3 FogØ5 in the UC1...... 25 3.4 Zenoh FogØ5 in the UC3...... 26

4.1 Peer, client and router configurations...... 28 4.2 Zenoh stack diagram [18]...... 29 4.3 Zenoh messages interchanged during communication...... 32 4.4 Zenoh over TCP...... 36 4.5 Zenoh over UDP...... 36 4.6 Wireshark’s Zenoh dissector without heuristics...... 37 4.7 Wireshark’s Zenoh dissector with heuristics...... 38 4.8 Diagram of demonstrator presented to the consortium...... 39 4.9 Partial solution to the partner’s problem...... 40 4.10 Final architecture of the integration with Zenoh...... 40

5.1 Zenoh throughput (Millions of messages per second)...... 43 5.2 Zenoh throughput (GB/s)...... 43 5.3 Comparison between Zenoh P2P and Zenoh brokered...... 44 5.4 Throughput in MB/s...... 45 5.5 Throughput in Mmsg/s...... 45 5.6 Zenoh-net API in P2P mode (localhost)...... 47 5.7 Zenoh API in P2P mode (localhost)...... 48 5.8 Zenoh-net API in client mode (localhost)...... 48 5.9 Zenoh API in client mode (localhost)...... 49

xi List of Figures

1 P2P deployment diagram...... 62 2 Example of a mixed Zenoh topology...... 65 3 Architecture of the example...... 66

xii 1 Introduction

1.1 Context

Since its existence, the human being has always sought ways to understand and resolve the problems encountered during evolution. The collection and the analysis of the information usually helped to infer and finally predict, to foresee and act even before the problems appear.

Different tools have been used to help in this task. The developments in computing made it possible the creation of what is called Artificial Intelligent(AI) and Machine Learning(ML) to infer and predict actions or results. Such technologies require large amounts of information, for example collected by sensors, that have to be processed in large infrastructures.

Nowadays, the most popular approach is Cloud Computing [1], which brought advantages such as virtualised resources, dynamic and scalable services, resources on demand, elas- ticity, etc. However, Cloud Computing usually follows a centralised approach, requiring collecting the data from Internet Of Things(IoT) networks through the Internet to a centralised location. Data flowing between private networks to the Cloud (usually pro- vided by a third-party company) could endanger the privacy of our data. Also, the huge amount of data flowing to a Cloud could collapse the actual data networks. Furthermore, the latency introduced by the communication with the Cloud and the processing time may be unacceptable for real-time applications.

A recent example of the vulnerability of the Cloud is the failure of Fastly, one of most popular Content Delivery Network(CDN) and Cloud Computing services providers, causing problems to a large number of webpages and services along the Internet [2,3].

Fortunately, in the recent years other less centralised paradigms appeared. For example hierarchical architectures (, Fog Computing) and distributed architectures (Mist Computing). Depending on the objectives it will be necessary to select one of the

1 Chapter 1. Introduction solutions or to combine them.

1.2 NextPerception

In this regard, the H2020 NextPerception research project is working towards the creation of new architectures for the analysis and processing of information. NextPerception is an H2020 project in which the EU has decided to create and financially support for three years (2020-2023) a consortium of companies, research centres and universities to create a solution to a series of application proposals (also called use cases) with low latency, data analysis and computation requirements. The Information Technology Group (GTI) of the AtlanTTic Research Centre of the University of Vigo is the partner in charge of the proposals in the field of communications and networking of the architecture proposed by Nextperception.

The main objective of NextPerception is to develop intelligent sensors in combination with methodologies to support the design and management of distributed solutions and the creation of demonstrators in resource-intensive applications in the fields of health and automotive. In particular, demonstrations of the sensorisation and data analysis solutions are applied in psychological/motor monitoring systems as well as vehicle and traffic management with predictive analytics applications. The proposal is to use these new intelligent networks to enable applications to make decisions with the aim of improving health and well-being as well as ensuring the integrity of people.

The project distinguishes 4 objectives or large blocks:

• O1. State of the art and exploration of current sensors with real utility in Health, Wellness and automotive fields, as well as the choice of non-intrinsic devices and the definition of the physiological parameters to be measured.

• O2. Transform the information obtained by the sensor network for decision mak- ing. Support proactive decision support with predictive analytics and explainable artificial intelligence.

• O3. Design a reference architecture to support the design, implementation and management of intelligent solutions in distributed sensor networks. Obtain a high- level technology that facilitates the creation of intelligent applications in sensor networks (e.g. middleware).

• O4. Validate and demonstrate the viability of the solutions proposed in the previous objectives in real and controlled environments in the fields of health, welfare and automotive.

The project is organised in six Work Packages (WPs): from WP1 to WP6. WP1 lays the

2 1.3. Objectives right foundation by examining the State of the Art (SotA), and getting requirements from the SotA and a bunch of NextPerception use cases. WP2 and WP3 are focused on the development of new contributions in the different fields of the project, including intelligent sensors, firmware, distributed intelligence, etc. These innovations will be incorporated in WP4, package responsible of implementing the demonstrators. WP5 is responsible of dissemination and WP6 of the management.

The project considers three different use cases (UC), which will be used to demonstrate the new technologies developed:

• Integral vitality monitoring (UC1). Measure and monitor health, behaviour and daily activity parameters of people in need of special attention or care in a non-intrusive way. Four possible test scenarios are envisaged: monitoring activity patterns in a home, monitoring the health status of elderly people, monitoring through wearables during physical activity and finally monitoring sleep.

• Driver monitoring (UC2). Basically focus in driver monitoring features in the context of partially and highly automated driving. The main objective is to driver and drive environment monitoring. The demonstrator are a driving simulator, monitoring of driver and driving environment and finally the application of both previous demonstrators to heavy vehicles like trucks.

• Safety and comfort at road intersections (UC3). Related with UC2 case, it address the safety at road intersections using people monitoring and also vehicle in- formation. UC3 has more 14 demonstrators focus in V2V and V2X communications, collaborative detection for the simulation of 3D real spaces, etc.

The different partners, depending on their specialisation, are distributed in order to contribute and work on a specific objective, so that the partners who share a common objective work collaboratively. Due to the field of study and work of the GTI in communication networks, this was associated with objective O3, the subject of this master’s thesis.

In the following section we introduce the particular objectives that we want to fulfil with this work.

1.3 Objectives

The main objective of this thesis is the design of a distributed computation architecture based on the new paradigms and technologies, such as Mist Computing, that have arisen in the last few years. As mentioned before, this work is made as part of the NextPerception project.

3 Chapter 1. Introduction

We proposed, as part of the WP3 in NextPerception, a reference architecture for fusion and cooperative inference in distributed sensor systems. These sensors also provide their own computing capacity, which can be complemented with Edge Computing, to be able to process and fuse the information captured in the environment. All these functionalities should be secure and the latency included between the IoT-Edge/Fog-Cloud Continuum should be small. Thus, we propose a solution that uses the computing power of the sensing device when possible, avoiding data and computation overload at the Edge or Cloud, or distributing the computing among all the available devices (sensors, Edge and Cloud). The cornerstone for this solution is the distributed communication protocol, the centre of attention for this thesis.

This thesis starts with an exploration phase and study of the state of the art regarding the existing communication protocols and computation architectures, to continue with the validation of possible viable candidates, which will be tested in the UC1 use case (health monitoring). This use case, as can be seen in figure 1.1, is made up of several layers. The sensors, concentrators and gateways that communicate via Bluetooth and, on top of that, the Cloud and the applications.

Figure 1.1: Basic architecture of the UC1 from NextPerception.

Therefore, in order to provide an architecture for distributed processing for the health monitoring use case, it is necessary to select the adequate protocol, and to make it work with the deep-embedded boards and the Bluetooth Low Energy(BLE) communication protocol. To satisfy both objectives, in chapter 2 we completed a review of the available technologies. Then in chapter 3 we proposed a distributed computing network. To this end, we collected the requirements imposed by the use case explained, such as the existence of vertical or horizontal communication, type of communication protocols, resource management systems, etc. In order to choose the best alternatives, the communication elements and network orchestration mechanisms were studied in depth. Then, in chapter 4 the selected technology (a new solution, that is being developed in parallel) was tested under different conditions, and finally in chapter 5 its performance was studied through different experiments.

4 2 State of the art

2.1 Introduction

In order to understand the alternatives that can be used to satisfy the requirements, we will explore different popular solutions as well as new promising alternatives.

Information is in the data. Perhaps this is why there is a trend towards connecting almost any device with the aim of obtaining data or information relevant to industrial processes, new knowledge or our daily lives. This search for information and knowledge through data analysis to automate processes or to infer and predict results has led to what is known as IoT. The current trend has been to centralise data where to run a data analysis algorithm and return results in the form of actions or knowledge. However, the application of this type of networks to real-time systems, homes, cities, vehicles, industry, among others, has driven the consolidation of Cloud Computing architectures and the creation of new paradigms such as Fog and Mist Computing, which provide the processing power to analyse the information generated in use cases such as autonomous driving or autonomous robots; support the creation of databases for AI and ML training; and support many other applications.

Therefore, in this chapter, in the fisrt section we will discuss the different computing paradigms and the path from more centralise architectures to decentralisation. Then, we will discuss what needs of communication has a distributed architecture and what communication protocol could fit better. Finally, because of importance of sensors in a distributed IoT-Edge-Cloud environment we will introduce some of the applications that runs over small devices.

5 Chapter 2. State of the art

2.2 From centralised to distributed architectures

2.2.1 Centralised architectures

Despite initial scepticism, Cloud services became increasingly popular and today are a natural choice in many application fields [4]. Consequently, all giants of the Information Technology(IT) industry, along with many other independent providers are increasingly investing into their Cloud infrastructure.

Conventional Cloud architecture relies on powerful computer systems and data com- munication networks to process, store and retrieve data from the end users. Fast and power efficient data communication is essential for this type of Cloud. The main advan- tage of such a Cloud architecture is the availability of powerful computing hardware in datacenters, which makes computationally intensive operations possible.

However, this trend is gradually changing last few years. Massive data generation and network overload, together with certain privacy and latency concerns have caused a decrease in the use of Cloud [5,6,7]. Typically, in Cloud-based AI, inference as well as learning occur in the Cloud. For example, digital assistants, such as Amazon Alexa, upload recorded voice to a Cloud server, where AI-based speech recognition is performed, and send the answer back to the local device. However, such centralisation of the processing is not ideal for use cases such as face verification for authentication purposes: too much of such rich data sent by millions of simultaneous users can overload communication channels. Moreover, for privacy reasons, end users are not eager to have their images continuously transmitted from camera to Cloud servers. Finally, transmission of raw data to the Cloud and return transmission of results introduces a delay, which is not desirable in all applications.

As a result of these concerns, smartphone manufacturers are already introducing local inference into their devices using powerful but energy efficient AI engines. Due to the reduction in the number and volume of the communication with a central computing unit at the Cloud, the latency to get a result is significantly reduced.

Similar trends are visible in systems for traffic surveillance, healthcare, agricultural and other applications. Due to the exceptionally large number of sensors, including many processing rich data like cameras and radars, transmission of all captured data to a centralised Cloud server would needlessly overload communication channels, and endanger the system functionality. The second important factor here is that many of these multi-sensor systems require answers provided in a shorter time than the allowed by the sensor-Cloud-actuator path. Again, privacy issues, and the need to protect valuable data (the business value is increasingly in the data), are not the strongest points of the Cloud.

6 2.2. From centralised to distributed architectures

Nevertheless, Edge AI devices are significantly less powerful than their corresponding Cloud counterparts. E.g., Google’s Cloud TPU has a performance of 180 teraflops (floating point operations per second), whereas the Google Edge TPU is processing at 4 teraops (8-bit integer operations per second). Correspondingly, Edge AI algorithms such as neural networks need to be significantly simplified to be deployed at the Edge. In some applications, like face recognition, such simplification may not be desired, e.g., synchronizing copies of a facial database of millions of people is an even bigger nightmare than needs sending it in compressed form to thousands of Edge devices. An alternative therefore consists in a hybrid solution, i.e., combining Edge AI with some additional processing close to the Edge and/or in the Cloud, as shown in Figure 2.1.

Figure 2.1: Processing location selection.

2.2.2 Hierarchical architectures

In response to centralised systems and their drawbacks such as latency and network overload, one of the first approaches was to bring computing resources physically closer to the end user or application, as shown in Figure 2.2. Thus, applications are placed in a layer or another depending on the latency requirements. In case they do not require a very low latency, they can be deployed in the upper layer, which would be the Cloud. This solution does not offer the possibilities of a distributed network, but it does simplify and provide solutions to the limitations of Cloud Computing.

Figure 2.2 also presents a new layer: the Fog. This paradigm was proposed [9] to reduce the latency and enable better connectivity in systems with large number of sensors. The main idea behind Fog Computing is to cover the space between the Cloud and the sensors/actuators layer, by enabling multi-access and providing computational power close to the Edge.

7 Chapter 2. State of the art

Figure 2.2: Fog Cloud Computing architecture [8].

The Fog Computing paradigm is usually combined in a hierarchical approach with Cloud Computing, which tries to overcome the shortcomings of earlier approaches by bringing computational engines, memory and storage closer to wireless gateways, sensors and actuators. Thanks to the Fog architecture, it will be possible to perform highly intensive computations using less-powerful hardware thanks to the facilitated collaboration between computation units. The Fog architecture supports a physical and logical network hierarchy of multiple levels of cooperating nodes to support distributed applications. Compared to the classical Cloud approach, Fog architecture offers many advantages for IoT applications.

Typically, one or more sensors, computation and communication units are enclosed in the same sensor box. Due to the limitations in power consumption, size and computational power, sensor boxes do not have enough power to process rich data, such as video or radar. At the same time communicating rich data to a Cloud introduces a large latency, which often do not meet the application requirements. Thanks to the Fog architecture, efficient communication with closest sensor boxes is possible, which enables more demanding computations to be carried out.

However, the Fog architecture requires careful automatic orchestration of different devices and layers, resource polling and handling interactions and collaborations between Fog nodes at different layers in the hierarchy. To fulfil orchestration tasks, 5G orchestration, such as Management and Network Orchestration(MANO) standard, focused on the Network Functions Virtualization(NFV), can be extended to orchestrate the whole

8 2.2. From centralised to distributed architectures

Fog architecture at distinct levels, from sensors, Edge Computing servers to Cloud infrastructure.

At the lowest level, we find the sensorisation layer. The hierarchical structure establishes another new computing paradigm that is very specific to the Edge of the network: the Mist Computing [10]. This new approach present the idea of bringing the computation near the edge IoT to enhances application in terms of latency, power, security, etc. While in Cloud and Fog the functionality is fixed, in Mist Computing there are dynamic and adjustable functionalities, with applications that can be adapted to existing running devices. It is a dynamic environment using a subscription-based information model in which devices must be aware of and adapt to information needs and the network. Mist Computing goes a step further and moves IoT networks from simply generating data to information.

2.2.3 Distributed architectures

Figure 2.3: Distributed computing architecture.

Decentralised distributed computing architectures rely on multiple computing units communicating directly exchanging data with other. A decentralised architecture can provide low latencies and enough bandwidth to handle large amounts of rich data. Since there is no central unit, this type of architecture is quite resilient to failures, as one failure results in degrading of the performance and not in complete malfunction of the system as in the case of the centralised system.

The main disadvantages of a distributed system compared to a centralised system are that a distributed system 1) in general are more difficult to setup and 2) more difficult to manage. To solve these issues, the distributed system needs some additional management functionality:

9 Chapter 2. State of the art

• A system to distribute software components along heterogeneous networks. The components need to be able to run on different nodes in the network (Cloud, Fog, Mist). Considering that the network nodes may have different platforms, operating systems and even CPU/accelerator architectures, the software components need to be able to cope with the platform differences and in addition be heterogeneous in nature.

• A routing system to handle the communication between the nodes. Some nodes are data providers, other nodes are data consumers. The connecting nodes may perform a part of the computations and for the remaining route the communication to the desired target (consumer or provider).

• A local coordination approach, for deciding which actions need to be taken when a certain event happens (e.g., failure of a node, discovery or activation of a new node, ...). For maximal fault tolerance, both the distribution and routing systems need to be decentralised. This way, a software module on a node can be launched, e.g., by a nearby node. Arbitration is needed, in order to appoint a coordinating unit which takes a leading role in case a certain action needs to be taken.

The aim to use distributed computation such as self-learning algorithm in some applications is to reduce data flow and, also thanks to the reduction of the amount the information transmitted, to prolong battery life. Self-learning, as a machine learning paradigm, employs learning with no external rewards and no external training, in comparison to other paradigms such as reinforcement learning.

In this manner, the distributed system can adapt themselves to environmental changes, such as a change of network topology, failing nodes etc. Also, the communication bandwidth over the network increases when the number of sources of information, such as sensors or cameras grows. The goal is then to exploit the processing capabilities of the edge devices to perform pre-processing and to exchange messages between nearby edge devices (e.g. a cooperative camera network approach, which stands for a system where information derived from multiple camera sensors is fused into a larger system in order to obtain advanced perception of the monitored environment), so that the overall communication bandwidth can be reduced.

The pre-processing at the level of sensors (wearables) is a clear example of how the distribution of computation may help to reduce data flow, and how algorithms with higher computing requirements are applied at a higher hierarchical level (gateways and Cloud) where computing power is enough.

In order to be fully operational, distributed systems require some additional functionalities such as automatic system configuration, node discovery, security (e.g., using encryption), creation of hierarchical structures based on grouping of nodes (e.g., several nodes perform a cooperative sensor fusion task), etc.

10 2.3. Data distribution

As a mixture between Fog and decentralised architectures, the grouping of nodes is highly relevant for information fusion at more located level, e.g., between neighbouring sensor boxes. For example, a coordinator may be grouped with several edge devices, so that they are seen as one node by the rest of the network. This creates a local cluster, which is especially useful for implementing sensor applications, simplifying security and offering scalability to large networks.

The use of decentralised distributed computing architecture in monitoring applications brings several benefits. The main advantage of this architecture is that it allows to reduce the amount of information transmitted between sensors and other higher hierarchical levels, and thus extend the life of battery powered devices. Another advantage is that each individual computing device is responsible for partial computation. The results in a low-latency system composed of computing devices that may not have an enormous computing power. Of course, there are also disadvantages. For example, it is expected that the raw measured data obtained from the sensors is going to be highly redundant. The coordination between devices is not easy, requiring complex protocols. It is a project still under development. The authors are working in extending its application to micro- controllers as well as in the design of algorithms for the provision of applications in a distributed and dynamic way depending on the resources available on the network.

2.3 Data distribution

In this section we will study some procols designed for IoT, analyzing their cons and pros in order to understand what could be the best option for the design of a system like the one that it is proposed later in this document.

The collaborative fusion of information that is collected from different sensors requires exchanging different types of information, such as discrete measurements (temperature, acceleration, location, classification of the users, etc.), video, clouds of points, radar signals, etc. Depending on the nature of the network, the sensors may appear or disappear suddenly (i.e. sensors installed in mobile elements such as cars, pedestrians, etc.). In the same way, the distributed nature of the IoT computation networks makes it difficult to predict which devices generate or consume raw or processed information.

Thus, this is a challenging scenario for communications. Devices may change their location or address, and it will not be possible to know during the setup time, what device will perform what operation.

A possible solution is the creation of a central entity (for example located in the Cloud) responsible for receiving notifications from the different elements participating in the network. Those messages would provide information about the state of the device, its capabilities, its needs and so on. This central entity could make decisions and select the

11 Chapter 2. State of the art information that should be pushed back to the devices participating in the application. Not all devices will require or will be able to process all the information, so they should indicate which information they want to receive for it needs.

At this point, our central entity reminds us of an intelligent publish-subscribe broker. This paradigm is extremely useful communication service when it is not clear in advance who needs what information [11].

A publish-subscribe architecture delivers events from sources to interested users, who notify their interest in receiving the information with the subscription procedure. This way, when a new event or information is generated and published to the system, the publish-subscribe infrastructure delivers the information to the subscribers (Figure 2.4). Of primary concern is the efficient distribution of data with minimal overhead an the need to scale to hundreds or thousand of subscribers in a robust, fault-tolerant manner [12].

Figure 2.4: Event based information delivery model.

The publish-subscribe communication has some some different characteristics compared to more traditional point-to-point communications, which result often with less interest in IoT networks due to the interaction between components in such networks. The communication is anonymous (information sources do not need to identify the destination of the information or vice versa), asynchronous (the sender does not need to wait for an acknowledgment from the recipient) and multicast (a publisher just sends the information, and it is received by many subscribers). This architecture can cope with a dynamically changing operational environment where the publishers and subscribers frequently connect and disconnect. The publish-subscribe paradigm has two principal standardised protocols: Message Queuing Telemetry Transport(MQTT) and Data Distribution Service(DSS).

The MQTT protocol follows the publish-subscribe pattern, using a centralised broker which routes the communications between sensors, devices and applications along the network. However, MQTT has disadvantages for a distributed computational network. For example:

12 2.3. Data distribution

• The broker is a single point of failure. The presence of this element does not allow full network decentralisation, as all communication goes through the broker.

• The additional hop introduced by the intermediate communication with the broker could introduce excessive latency for some real-rime operations.

• Requires devices to be permanently connected to the centralised broker through a TCP/IP link, which may not be easy to maintain.

Thus, it would be necessary a protocol that allows the nodes to exchanged information directly in a fully distributed manner. An interesting alternative could be DDS that provides a fully distributed publish-subscribe protocol with real-time scalable, data-centric publish-subscribe capabilities [13].

Figure 2.5: DDS communication scheme [14].

In DDS the network topology is discovered dynamically and connections between nodes are established peer-to-peer without a central server as a single point of failure, so obtaining less delay in the exchanged of message. MQTT defines three levels of QoS: at most once, at least once or exactly one. QoS is one of the main characteristics supported by DDS, making possible to configure to each topic (a topic is the unit of information that is distributed over the network) a set of QoS policies [15] (e.g. reliability, destination order, availability, etc.).

Each topic is bound to a data-type [13], it describes the data in a machine readable way.

13 Chapter 2. State of the art

However, MQTT it not aware of the data, it implies that the applications must to agree previously the data structure making possible the understanding in future updates. DDS is data agnostic while MQTT does not [16].

An important characteristic for a distributed architecture is the level of abstraction provide by the protocol for the messages send between peers as well as the shared variable states [16]. DDS domain allow subscribers to retrieve information without a prior knowledge of the data. The autodescription of the packages and dynamic discovery without a central broker, as MQTT, make the protocol data centric. However, MQTT needs to track the variables states itself in the central broker.

In the figure 2.6 there are some of the main characteristics to value from the most popular IoT protocols used nowadays. Nevertheless, Constrained Application Protocol (CoAP) has been discarded because of its RESTful architecture like actual web services. Moreover, Advanced Message Queuing Protocol(AMQP) do not supports discovery between elements in the network as DDS does.

Figure 2.6: Protocols comparison table [17].

Even though DDS seems the best option among the studied protocols, there is a recently born protocol named Zenoh which deserves to be considered in this thesis. This protocol is still under heavy development, and since last year Zenoh [18] has been implemented with a main objective: integrate data at rest with data in motion [19], obtaining a convergence between IT and Operation Technology(OT) systems; integrate publication/subscription operations with query/evaluation operations provided by the RESTful paradigm. In addition, the protocol seeks to improve the lack of scalability of DDS in terms of inter- network communication (e.g. internet scalability), the performance over Wi-Fi networks and the union of the three architecture Mist-Edge-Cloud Continuum.

14 2.4. Embedded real-time operating systems

2.4 Embedded real-time operating systems

Sensor networks, which we also call the edge of the network, have been mentioned several times in this work. This part of the network is made up of nodes that usually have a limited set of resources due to their small size as well as their limited power capabilities. In addition, many of these nodes are observers of the environment or, conversely, actuators under certain conditions. Response time constraints make a real-time decision making system, such as a vehicle braking ABS system or a robotic arm on a conveyor belt, more stringent or not. Early embedded systems were generally single-target systems running on a processor. However, the current trend is to develop embedded systems that can execute tasks from different domains as well as support upgrades or new applications. This need for flexibility and multi-purpose has led to the current trend to use specific operating systems [20]. These types of systems use an embedded operating systems (an operating system designed for a specific task, that is, an operating system without general purpose components). In many cases the embedded operating system can be classified as a Real-Time Operating Systems(RTOS). They are mainly characterised by being operating systems with very well-defined and almost deterministic execution times, otherwise the response could be unsuccessful. In addition, many of these systems, in cases where response times are not satisfied, may interpret a late response as erroneous (hard RTOS) or acceptable (soft RTOS).

In our scenario, there are different components running embedded or RTOS operating systems (in sensors placed at traffic intersections and carried by intelligent pedestrians). Thus, we decided to have in account three well known Open Source RTOS to present their fundamental characteristics as well as their compatibility with devices from different manufacturers in order to understand if we can implement the proposed distributed network in a typical embedded device.

• Apache NuttX [21] is a real-time operating system (RTOS) that emphasizes standard compliance and small footprint. Scalable from 8-bit to 32-bit microcontroller environment, the main management standards in NuttX are Portable Operating System Interface(POSIX) and American National Standards Institute(ANSI) standards.

• Zephyr OS [22] is based on a small kernel and is designed for resource-constrained embedded systems: from simple embedded environmental sensors and LED wear- able devices to complex embedded controllers, smart watches and IoT wireless applications.

• FreeRTOS [23] is released under the free GPLv2 license and has been specially modified to allow the use of proprietary code without release. It can be easily integrated with other free peripherals and port management libraries, such as: FatFS FAT file system handler [24], lwIP lightweight TCP/IP stack

15 Chapter 2. State of the art

[25], USB port handling driver, etc. It is also integrated with various Integrated Development Environmets(IDEs) used to handle programming and debugging functions, such as Eclipse and MPLAB.

16 3 Design of a distributed computing architecture

During the last decade, Cloud processing became a preferred way of working for many users. There are numerous advantages of Cloud, like the access to massive storage from any location, the possibility to work concurrently on the same data for multiple users, better reliability compared to the individual hardware, etc. One of the key advantages of Cloud is the possibility to use powerful computational resources, which can be upgraded without a costly field maintenance. Another is the possibility to increase the robustness of the solutions by adding redundancy or graceful degradation features. There are also downsides, such as the need for communication (degrades robustness and increases privacy concerns) and its costs. Thus, the use of these computation resources in an efficient manner is not possible in all the usage scenarios. For example, facial recognition can be used as an identification/authentication method, but it requires sending large amounts of video data to the Cloud. This approach is not specially effective as it may overload communication channels, can and be expensive for the user and the communication channel may not be available when required. Because of this, many smartphone manufacturers implement this identification method locally on the smartphone. Besides alleviating communication, local authentication adequately addressed privacy concerns of end users and significantly reduced latency.

In scenarios with a large number of sensors, such as surveillance, mobility and healthcare; transmitting all rich data to the Cloud in real time would become costly or even prohibitive. Also, it would not be possible to meet much shorter latency requirements set by the control loops. Protecting company secrets or sensitive private data is usually also a large concern. On the other hand, it would be highly impractical, inefficient and costly if every sensor would be connected to powerful hardware.

To overcome these and other issues, a hybrid model for distributed computation is proposed: an IoT-to-Cloud integration, IoT-Edge/Fog-Cloud Continuum [26]. Such paradigm offers improved real-time performance, more reliable communication, significantly reduced latency, improved power efficiency and better security. The architecture, protocols and

17 Chapter 3. Design of a distributed computing architecture technologies described in this document should be powerful enough to deploy distributed algorithms onto systems composed of a mixture of cloud services, smart sensor boxes and local sensor fusion hubs.

3.1 Fusion strategies and computational load distribution

Complex distributed systems with QoS requirements, such as surveillance systems or automated decision systems at intersections, operate at multiple functional levels. For example, leaf nodes (sensors) distributed in the intersection area are the lowest level including data from car sensors into the system (extreme edge). The next level is composed by sensors with permanent power located at the same intersection. The third level of organisation could be formed if we observe multiple intersections in the same city quarter. Finally, the highest level of hierarchy could be formed by processing in order to extract measures and analytics in a non real-time plane in the Cloud for all the intersections of a city or user-determined zones.

To facilitate the computation and reduce the latency, which is crucial in order to provide a result before a deadline, it is a good idea to organise computations and data flows depending on the application (level at where the data should be fusioned with other data, processed, etc.). In order to ensure the safety of pedestrians and other road users, it is necessary to precisely determine their locations. For this purpose, we can use information from all the levels of the system. For example, in the intelligent intersection use case, a centralised processing of all sensors using a cloud infrastructure would be expensive and would require the transmission of a large amount of information. Instead, we can rely on the Edge or Mist processing paradigms by taking advantage of the computing power at sensor locations and in vehicles, to provide swift reactions on high priority events.

The cooperative distributed processing paradigm offers an ideal solution to this situation. Intersections and vehicles can perform processing locally and autonomously, and also communicate with the other intersections and centers when possible, to improve their analytics and to build a wide scope model of the traffic situation.

In terms of data analytics and data fusion, we can consider the following types of fusion:

• Central fusion. All data is brought together to a central point. The data heterogeneity allows the richest possible fusion creating a multispace domain with N variables of data. This approach also allows flexible upgrading, because all processing runs in the fusion centre, which could be in a cloud or edge cluster. It takes advantage of more processing power, data storage and also with the replacing of new algorithms or new functionalities to deploy. However, the centralisation of information at a single point brings with it a number

18 3.1. Fusion strategies and computational load distribution

of disadvantages such as the higher bandwidth and computational capacity required. Although the bandwidth could be reduced with compressing techniques, these introduce more computation and therefore increase the latency to provide a result, which is highly undesirable in real-time or low/ultra-latency applications.

• Extreme fusion. Data processing is moved to the device itself where the informa- tion is collected locally. The main idea is basically that the sensors detect, analyse and simplify the messages by abstracting them (positions, identifiers, states...) to send them to a central fusion centre to obtain final results oriented to the final application. In this way, bandwidth consumption is reduced and no delay is added beyond the inherent delay of the analytical algorithm being executed. In addition, due to the reduction of N:1 messages, the networks used can be simpler. Another attraction of fusion at the Edge is scalability: adding new sensors does not increase the computational load on the fusion centre or the load on the communication network. Basically, as each sensor has its own computational resources, the computational capacity scales naturally with the sensors. However, there are also some disadvantages. The first one is the limitation of the sensors in terms of hardware. Although it is possible to have certain types of software upgrades in embedded devices, the hardware upgrade is very costly and requires a high level of knowledge of the element itself. Another disadvantage compared to a fully centralised fusion is the potential loss of fused information from the different sensors. Because the processing and analysis tasks are done at the Edge, errors in the decisions performed by the individual sensors come into play. There are two types of errors that can occur: false negatives (not detecting an object of interest) and false positives (reporting an object when there is none or reporting wrong details on the object). False negatives are particularly relevant for performance, as failure to detect an event is loss of information.

• Cooperative fusion. As in extreme fusion, the data processing is done in the sensors themselves and abstracted messages are sent to the fusion centre. The big difference is that the smart sensors receive feedback on the final results of the analysis as well as the communication of results between neighbouring sensors. The feedback received from the fusion centre can be used in different ways: the results coming from the fusion centre analysis tend to be more reliable than those coming from the analysis at the Edge; furthermore, smart sensors can retrain themselves by learning from their mistakes, looking for evidence on objects that were initially not detected. Many variations on this scheme are possible: Sensors can decide to use data selectively; they can be designed to always use feedback data, or only when available (resilience to communication outage). Some fusion strategies can require low latency feedback in time critical applications; other strategies take advantage of slower and

19 Chapter 3. Design of a distributed computing architecture

less time sensitive feedback (even in time critical applications). Sensors can provide unsolicited feedback, or rather wait for a request from the fusion centre, which could selectively activate feedback depending on what it considers most useful. As in late fusion, it is possible to send not only detected objects, but also candidate objects, but there is less need for this, as the sensor can already make better detections due to feedback and can even send correction messages to the fusion center.

3.2 Architecture

In section 2.2 we studied the main computing paradigms that exist today: centralised, hierarchical and distributed. The proposal of this work is to present a distributed network proposal making use of the advantages offered by the different layers of a hierarchical one: the large computational capacity in the Cloud, the proximity of resources to the Edge of Fog Computing and the optimisation of resources at the Edge for ultra-low latency applications in the Mist.

To implement an architecture with this fully distributed use of resources and applications located at the different levels of the network, it is necessary to define a set of key characteristics that distinguish this approach from other isolated computing paradigms. Many of these characteristics mentioned below are those that can be found as requirements in Fog and Mist Computing, but they can also be applied to the Cloud, seeing it as a set of resources remote from the Edge but which can also be distributed with the rest of the edge nodes.

The main requirements to be covered in a distributed network are:

• Geographical distribution. It consists on geo-distributing resources in different locations in such a way that they are distributed in the environment. In this way, unlike Cloud Computing, it is possible to track end devices for applications with mobility or demand nearby resources in the case of applications with low latency needs. In terms of security, the geo-distribution of resources also increases privacy, as we could choose to distribute certain resources in our private network and filter beforehand what we want to send to the Cloud, for example.

• Location awareness. It is related to the large-scale distribution of Fog nodes. Although for some applications, it is preferable for nodes not to store state infor- mation from other nodes in the network, for applications where mobility comes into play it can be very useful. Making devices aware of their location can help to know which resources should be accessed earlier (therefore reducing bandwidth consumption and obtaining low latency) as well as enabling applications in which the physical location of the nodes is important (e.g. tracking applications).

• Heterogeneity. Sensor networks, Fog nodes and Cloud servers are of different

20 3.2. Architecture

architectures and nature, so a heterogeneity-ready environment is needed.

• Real-time iterations. In contrast to centralised architectures, real-time interac- tions are one of the principal objectives of Fog and Mist Computing. Decision-making at the extreme edge, or Fog, enables applications such as or pedestrian and traffic control applications to be made possible by reducing latency.

• Interoperability and federation. The coexistence of different services at Fog nodes as well as different Fog service providers, infrastructures must be able to cooperate and services must be federated in application areas. Control and signalling as well as data interfaces are necessary to enable interoperability between operators.

• Low Resource Utilisation. The network is made up of small devices and their resources are limited, so it is necessary that the control and management software as well as the distributed communication protocols are as light as possible, allowing the rest of the resources to be used in the tasks and applications they have to perform.

• Lightweight communication protocol. As we move down the architecture from the Cloud to the Mist, the physical resources of each node are increasingly limited, so it is important to provide protocols with low overhead that are capable of being executed by the embedded devices that can be found in Mist networks such as micro-controllers.

• Security. Communications between devices need to be secured, especially when they leave the private domain. Communications over public networks to the service providers platforms must be secured to prevent any of the existing methods of cyber-attacks on communications (e.g. man in the middle). However, securitisation may lead to higher latencies (due to encryption and decryption of information) and resource-poor devices may be overloaded with these tasks.

• Dynamic discovery. For distributed and published resources to be shared and accessed, they need to be dynamically discoverable. This relies on the communica- tions protocol or middleware used (e.g. DDS or Zenoh), eliminating dependencies in server configurations and facilitating deployment.

• Hardware and I/O discovery. In Mist Computing, the discovery of available hardware and I/O (Input/Output) devices on the nodes can allow applications to be distributed across multiple nodes in case they do not have sufficient resources on a single physical resource such as GPIO.

• Large scale deployment. With Industry 4.0 and the deployment of 5G networks and their small cells, the increase in sensors and small devices is growing into the millions (see figure 3.1).

21 Chapter 3. Design of a distributed computing architecture

• End-to-end unification of computing, networking and storage. To provide the network capacity for the computing, storage and network management of a complete end-to-end computing architecture.

Figure 3.1 shows the diagram we were talking about. It presents the three layers in the hierarchy (Mist, Fog and Cloud) and how the number of devices decreases as we move away from the lowest part of the network. The right-hand side of the figure shows how the latency increases as we move towards the higher layers.

Figure 3.1: Complete architecture [27].

Once the requirements for a distributed network for the IoT-Fog-Cloud Continuum have been defined, it is necessary to understand what elements and mechanisms enable these requirements to be met. For example, heterogeneity is achieved through resource virtualisation. Therefore, three pillars are distinguished that will shape the architecture proposed in NextPerception.

The first fundamental pillar to be addressed is how the nodes of the distributed network will communicate. In the state of the art, the most commonly used protocols in distributed communications were presented. As explained, publish/subscribe message protocols are the best solution, since technologies such as DDS, MQTT or Zenoh provide an abstraction to the network, so that each node produces or consumes information at the time it wishes. Moreover, they are protocols that can be implemented in embedded devices, obtain high transmission rates (especially DDS and Zenoh) and are standardised and used in industry (MQTT and DDS). Zenoh is a promising technology in the development phase but with great advantages over DDS such as publication of computational resources, inter-network communication.

This pillar of the architecture is where the work developed in this thesis is focused,

22 3.2. Architecture helping to build the proposed architecture mainly through the exploration of technology, specifically in the exploration, testing and validation of the distributed protocol.

The final decision was to use Zenoh as the distributed protocol. The first reason is because it is a novel technology, with interesting features which we will explain in the following, and which comes out of the current DDS and MQTT schemes: NextPerception was born as an objective to encourage research, creation and development, so using already established technologies would not fulfil this European goal.

As a new and developing technology, the learning and understanding curve was quite high, mainly because information was practically non-existent. In the absence of documentan- tion on low-level operation, the exploration of the technology consisted of inspecting code, testing environments, participating in the GitHub’s community, contributing improve- ments in a Wireshark dissector and even establishing contact with ADLINK’s own research group, its creators, and collaborating with them through their gitter communication channel or videoconference.

An interesting topic in distributed networks is how to provide nodes with an abstract view of the resources of the network, i.e., how to make nodes see resources without having to know where those resources are, simply that they meet their latency, computational capacity and storage requirements. With this approach the network can be seen as a pool of resources, but without the need to be centralised as in Cloud Computing. In this way, the second of the three pillars, resource virtualisation, can be introduced. Through this technique, a higher-level entity has the ability to deploy any application on a set of virtualised physical resources, allowing the heterogeneity of the network that was previously highlighted.

Multiple virtualization tools exist (Citrix, QEMU, KVM, Docker, LXC, Kubernetes, etc) and are supported by almost any conventional computer. However, such technologies can be limited by the hardware on which they run, and they do not provide an abstraction of the resources distributed in a network. Although there are real-time operating systems dedicated to MPUs such as FreeRTO, Zephyr or NuttX and works for the virtualisation of their resources, the hardware limitations are significant.

Last but not least, it is necessary to talk about orchestration. Once we have distributed communication and the ability to use all the resources of the heterogeneous network, i.e. the abstractions needed for end-to-end unification, we need a tool to manage, monitor and orchestrate applications along the IoT-Fog-Cloud Continuum. Resources can be effectively managed where the density of devices is very high, allowing the status of the network to be obtained, topology changes to be detected and reactive changes to be applied accordingly [28].

Having defined the three fundamental pillars, we decided to focus on two of the three pillars on which it could provide the most expertise due to its experience in these fields:

23 Chapter 3. Design of a distributed computing architecture communication protocol and orchestration. We decided to study and validate if Zenoh and FogØ5 are valid for these tasks.

Firstly, As introduced before, Zenoh is chosen as a distributed communication protocol because of its unification of data at rest and in movement. This allows both persisted data (rest) and data generated in real time to be consumed by the same heterogeneous domain, allowing devices from different architectures to coexist and collaborate with each other. The auto-discovery of domains, the scalability and the option to establish secure communications, makes it a valid option and in accordance with the proposed hierarchical structure, in such a way that we ensure the integrity of the data when going out to public domains of private Fog operators or the Cloud.

Secondly, we selected FogØ5 for the management and the automated deployment of applications. It is an open source project (created by the Zenoh developers) that aims at providing a decentralised infrastructure for provisioning and managing compute, storage, communication and I/O resources available. It addresses highly heterogeneous systems even those with extremely resource-constrained nodes. This software allows to integrate all the components of the architecture (nodes of different layer) at a same level by virtualising heterogeneous systems.

Figure 3.2 shows the proposed communication stack for Nextperception. Zenoh would take care of distributed communication at all levels of the infrastructure. It can be seen that Zenoh includes the transport and network layer. This issue will be explained in more detail in the next chapter, but it is mainly due to two levels of abstraction provided by the protocol. Above Zenoh, there is Fog05, which is an application for the Zenoh viewpoint. Fog05 is also spread all over the network, as it allows the resources of the nodes at each layer to be virtualised and then managed and deployed, thus enabling the IoT-Fog-Cloud Continuum.

Finally, the NextPerception layer is the upper layer, an important layer but out of the scope of this thesis. It will provide a high-level language and middleware that will provide an abstraction to allow programmer to build fully distributed applications without requiring knowledge of the network, the location of the application or even the resources provided by each node.

Figure 3.3 shows how FogØ5 creates FIMs, which is a virtual domain of nodes that are managed together, and agents (the software running each node to build the FogØ5 entity). The diagram is associated with use case 1, presented above, and shows how this infrastructure unifies the different layers of the architecture to distribute the analysis of the information.

Another example that was presented to the NextPerception consortium proposing Zenoh and FogØ5, is shown in Figure 3.4. It shows how Zenoh would create the communications domain between all the sensors and nodes involved and then, through Fog05, how an

24 3.2. Architecture

Figure 3.2: Design of NextPerception stack.

Figure 3.3: FogØ5 in the UC1.

25 Chapter 3. Design of a distributed computing architecture

Figure 3.4: Zenoh FogØ5 in the UC3. application (e.g. for data analysis) would be deployed in a virtualised edge.

We completed a set of tests and developed a proof of concept to evaluate FogØ5, which is still in a beta stage. We found that it is very unstable and, while still being the best option for the orchestration, its current state is not valid for the project. We are in contact with the development team, and they are working hardly in the stabilization of the implementation, so we decided to postpone the integration of FogØ5 in NextPerception and focus our work in the distributed network:

• Documentation of Zenoh in order to evaluate its characteristics.

• Evaluate the capabilities of Zenoh to run over Bluetooth LE and RTOS.

• Evaluate metrics and compare to other competitors protocols.

26 4 Data distribution

One of the main objectives of this thesis is the selection, test and validation of a technology for the distribution of information in a distributed computing architecture. This chapter describes the completion of those different steps with Zenoh, a new messaging communication protocol/framework that is being developed. This was also a challenge, as many of the features were not available in the implementation, and their test required a close collaboration with the team that is implementing the different functionalities.

4.1 Zenoh

Zenoh ("Zero Network Overhead") is a data-centric communication protocol which blends traditional publish-subscription protocols with geo-distributed storage, queries and computations, while retaining a level of time and space efficiency that is well beyond any of the mainstream stacks.

It has been designed to support the needs of those applications which use the data in movement, in computation and in RESTFul paradigm in a scalable, efficient and location transparent data manner.

Zenoh provides three kinds of deployment units: peer, client and router (see Figure 4.1. In the peer working mode, the node is able to communicate with the other peers in a peer-to-peer topology, in a mesh topology or with other peers outside the network domain through a router entity. The router is the most important infrastructure component of Zenoh which is able to route data between clients and peers in any given topology. Also they are responsible for interconnecting Zenoh domains that are not in the same subnet. Finally, the client mode is the simplest way of communication in the Zenoh domain. It connects to a single router or a single peer to communicate with the rest of the system where both router and peer works likely a broker in MQTT.

In the architectural plane, Zenoh provides two APIs:

27 Chapter 4. Data distribution

Figure 4.1: Peer, client and router configurations.

• zenoh_net: it is a network oriented API which provides a abstraction layer to the transport and network layers, only caring about data transportation, without interest int the data content or in storing data. It provides primitives for efficient pub/sub and distributed queries. It also supports fragmentation and ordered reliable delivery. It provides the key primitives of the communication protocol:

– write: push data to the matching subscribers. – subscribe: subscriber to specific data. – query: query data from the matching queryables. It returns a stream of results depending on the queryables that matches with the sended query. – queryable: an entity able to reply to queries. A queryable is a well of values. Basically, a node declares a call-back triggered whenever a query needs to be answered. Generally a query is related with storage management as databases and computing.

• zenoh: is a higher layer over zenoh-net that provides the same abstractions as zenoh-net but in a simpler and more data-centric oriented manner, as well as providing all the building blocks to create a distributed storage system. The zenoh layer is aware of the data content and can apply content-based filtering, transcoding, as well as geo-distributed storage and distributed computed values. It considers the following primitives:

– put: equivalent of zenoh-net write primitive. It allows to push data to the matching subscribers and storages.

28 4.1. Zenoh

– subscribe: equivalent of zenoh-net subscribe primitive. – get: equivalent of zenoh-net query. Get data from the matching storages and eval. – storage: the combination of a zenoh-net subscriber to listen for live data to store and a zenoh-net queryable to reply to matching get requests. – eval: an entity able to reply to get requests. Typically used to provide data on demand or build a RPC system. It is equivalent of zenoh-net queryable but related with a back-end code implementing a specific computing.

The two APIs and primitives are shown in the Figure 4.2. zenoh_net API abstracts the network and transport network to the upper layers by making available the primitives explained above. The zenoh API, is build over its sibling zenoh_net but providing some more abstractions. As it can be see, the operations provide by its upper API are based in the basic primitives of the lowest API.

Figure 4.2: Zenoh stack diagram [18].

We can describe Zenoh as a distributed service that defines, manages and operates over data resources with the use of key/value spaces. In order to identify resources, Zenoh use paths as keys1. This paths represents resources as in a Unix filesystem.

Zenoh deals with the resources with keys/values where each key is a path likely a Unix file system path and is associated to a value. Zenoh supports different encodings for the values published. By default, the protocol is able to transport and store any data as long as it is serializable as a bytes buffer. However, zenoh requires a descriptor of the data encoding for advance features such as content filtering.

1Path and keys are synonyms in Zenoh, so it will be indistinctly throughout the document

29 Chapter 4. Data distribution

A Zenoh feature, which is important to highlight, that is implemented over other pub/sub protocol is the eval primitive formed by the queryable and query zenoh-net’s primitive. That element allows the node to register a computation resource at a specific path.

Zenoh guarantees the data ordering at any point of the system thanks to a timestamp associated to each value, avoiding the need of a consensus algorithm. So that, any node could be sure that the value is going to use is the newest. When a value enters in the Zenoh system, the first Zenoh router that received it generates the timestamp. The timestamp is generated using two components: the UUID of the Zenoh router which is going to generate the timestamp and a time generated by a HLC( Hybrid Logical Clock). The HLC gives a theorical resolution of 2−32 seconds (60 ns) so that the data order is preserved.

Finally, Zenoh provides a way to add functionalities to routers through the loading of plugins at start-up. A plugin is a library allowed to use the zenoh and/or zenoh-net APIs. By default, Zenoh provides a REST plugin that makes available a REST API at the port 8000 for administration or management of the data and a storage plugin that leverages a third-party database technology to store the key/values published and retrieved by Zenoh.

The need for securitisation of industrial network communications or sensor networks is a fact of life with each passing day. We are at a time when cyber attacks are becoming more and more common and the integrity of our data is being compromised.

One may think that data obtained from sensors for further processing can be sent in the clear, but a lot of valuable information about the functioning of an industry can be extracted from it. Therefore, the privacy that can be provided by the network communications protocol is very relevant.

Firstly, thanks to the proposed architecture, Mist and Fog Computing (private) installed in a private facility, provides a degree of privacy that was not as guaranteed with the use of the Cloud, as the data must travel over public networks and be processed in third-party infrastructures. However, with the IoT-Fog-Cloud Continuum approach, it is necessary to secure the communications that are established with other nodes that are not in a private infrastructure. Therefore, Zenoh provides a way to offer privacy.

Zenoh supports Quick UDP Internet Connections(QUIC) as communication protocol. The main goal of the protocol is to provide better performance for connection-oriented connections using TCP. Basically QUIC supports a set of multiplexed connections between two UDP hosts, offering security equivalent to TLS/SSL protocols as well as low connection and transport latency.

30 4.2. Zenoh protocol

4.2 Zenoh protocol

Like any communications protocol, Zenoh establishes a set of phases for discovery, connection establishment, information exchange and communication closure.

Zenoh supports TCP or UDP as transport protocol. For this work we will focus on TCP, as it is the protocol used in the most stable version implemented by its software team so far. The default port on which Zenoh listens is 7447, although it should be noted that it is configurable to avoid possible conflicts with other communications. Zenoh routers always listen on this port, and it is the one they always use for communications with clients or peers. However, communications between peers are always established with randomly chosen ports, using exclusively port 7447 for the discovery phase message (scouting).

Figures 4.3(a) and 4.3(b) present in more detail the messages exchanged by the protocol for an example of data’s publication.2 There is no information yet on the web or in official documentation on how the protocol works at a low level. For this reason, the diagrams discussed below were a great contribution to the understanding of the protocol.This information was obtained mainly through inspection of the Zenoh source code which is written in Rust [29] (it was necessary to learn Rust, a language on the rise in the programming world) as well as through a basic Wireshark dissector implemented by a member of the community which we will talk about later.

Five distinct phases can be distinguished in the figures:

1. Scouting. Discovery phase, which consists of sending a SCOUT broadcast message to port 7447. Those peer nodes or routers that hear that request will reply with a HELLO and the IP and port to use to communicate with them.

2. TCP establishment. Once two nodes have found each other, they establish a TCP session for communications.

3. Zenoh session: establishment. Once the transport layer session is established, the establishment of the Zenoh session takes place. For this purpose, INIT and OPEN messages are exchanged with information such as node type, identifier, etc. In addition, the declaration of resources (publication, subscription, queryable) as well as routing-related messages are exchanged. It should be noted that SCOUT and HELLO messages continue to be exchanged throughout the session to confirm that the node is still active.

4. Zenoh session: data distribution. Once the zenoh session is established, the data exchange starts as such, publishing in case of publication or listening in case

2Note that the diagram is indicative. Due to the lack of documentation, the continuous development being done by the Zenoh developer and the lack of a supported Wireshark dissector, the task of knowing exactly all the messages exchanged with clarity is difficult.

31 Chapter 4. Data distribution

(a) Zenoh messages in peer mode. (b) Zenoh messages in client mode.

Figure 4.3: Zenoh messages interchanged during communication.

32 4.2. Zenoh protocol

of subscription.

5. Zenoh session: close. When a node wants to close the session and leave Zenoh domain, it removes its resource from Zenoh domain with a DECLARE and a CLOSE.

The main difference between figures 4.3(a) and 4.3(b) is that the first example executes a SCOUT phase while the second one does not. The SCOUT phase is only executed in peer-to-peer communications for node discovery. A node operating in client mode will always have a pre-configured node to connect to (known in Zenoh as a locator), so it does not need this discovery phase.

For the routing of packets through the Zenoh domain, the nodes execute a linkstate routing protocol NDN[30, 31]( Name Data Networking). In this way, a set of routing trees is created from the data forwarders to the consumers. Therefore, the trees always start from the consumers at the beginning of the protocols.

As it was introduced before, Zenoh protocol is composed by two layers: the session protocol which establishes a bidirectional 1:1 session between two nodes. By default the session has a best-effort channel and a reliable channel; the routing protocol which uses the session to propagate route information as well as route data from producers to consumers.

Because Zenoh uses routers to route information, the application can reliably send data to the consumers. However, this procedure has some drawbacks, as in case of a node failure, those who depend on it will lose some information until the network is reconfigured. Because there are applications that cannot afford this kind of information loss, Zenoh provides different levels of reliability in order to improve hop-to-top reliability:

• End-to-End. Between each data producer and data consumer it is established a reliability channel. With this procedure the sample loss during topology changes is avoid but it introduces some disadvantages: it is less scalable and the resource consumption is increased both producers and consumers.

• FRLR. First Router - Last Router establishes a reliability channel between the first router and the final router of each route. This approach reduces the produc- er/consumer resource consumption by deporting this to the nearest infrastructure components.

Furthermore, in order to avoid or to react to slow network performance, Zenoh operates over control flow through a set of primitives. On the one hand, the applications generally know when data must be resend or not, so consumers can control reliability by selecting a resending strategy. On the other hand, producers and Zenoh routers are able to decide

33 Chapter 4. Data distribution how much memory want to dedicate to reliability, allowing constrained devices to secure resources to its applications while routers with more resources dedicate more memory to avoid congestion. Finally, if the infrastructure is suffering from congestion, producers can select message dropping. This approach propagated to the infrastructure allows nodes to drop messages from the reliability queue along the routing path.

4.3 Zenoh with RTOS

Zenoh has APIs available for the most popular languages including Rust, C, Python, Goland, Java and C. This gives the protocol versatility, as it can be used on almost any system. The team behind Zenoh has created a pure C Zenoh client aimed at embedded systems. This C client API is called Zenoh-pico. The company behind Zenoh, ADLINK, says in their GitHub the following: zenoh-pico targets constrained devices and offers a C API for pure clients, i.e., it does not support peer-to-peer communication. However, it has not been integrated in any of the most widely used embedded boards for IoT (NXP, Nordic, ESP32, etc.).

Therefore, with the aim of understanding if we could use Zenoh in the more resource- limited sensors, we decided to test the operation of Zenoh-pico. We proceeded to study the code and libraries that Zenoh-pico uses for its subsequent integration in some embedded boards.

The first board we decided to use was the Nordic NRF52840 DK, a sigle board development kit for BLE, Bluetooth MESH, Thread, Zigbee, 802.15.4, ANT and 2.4 GHz. The default Nordic SDK does not include most of the libraries used by Zenoh-pico, so we decided to use the Zephyr RTOS as the base platform to integrate the code. Zephyr, however, does not implement all the libraries that are available for bigger operating systems as Linux, otherwise the base system would take up too much space in the memory of the devices. For example, the POSIX libraries are not fully implemented. This was a limiting factor for using Zenoh-pico in Zephyr.

The problem that was detected in Zenoh-pico is that it has a design flaw. The implemen- tation makes use of all the libraries that any Unix operating system has, such as pthread for threads, POSIX for process management, the IP stack, among others. Zephyr has lighter versions of the libraries or does not have them directly. Moreover, the original version of Zenoh, written in Rust, has the session layer abstracted, so that it can be taken to different systems as long as the session layer is configured. Zenoh-pico does not have this abstraction for the session layer, so in order to integrate Zenoh-pico into Zephyr, it is necessary to rewrite the code according to the libraries that RTOS has, or to abstract the session layer first and then make the adaptation.

This was notified to the ADLINK team that designs and implements Zenoh through their

34 4.4. Wireshark dissector public communications channel and they sent back a reply to the GTI affirming what was mentioned in the previous paragraph.

Thus, after some tests, we determined that porting the complete Zenoh-pico stack to Zephyr would require an excessive effort, so we discarded completing this task.

It is worth noting that after testing Zephyr, we studied the possibility of integrating Zenoh-pico with . However, this was quickly discarded because Arduino does not support thread management.

Ultimately, the conclusion is that Zenoh-pico must be adapted to virtually every specific software developed for each board. At the time of writing, the Zenoh team has managed to integrate their solution written in pure C into the Zephyr RTOS, but they have communicated to the community that their efforts are focused on developing the full implementation of Zenoh on Rust before continuing to contribute to a solution on embedded hardware. Rust, despite being a great language for embedded systems, it is still not supported in most of them, including Zephyr.

4.4 Wireshark dissector

Among the tools used for networking application debugging, one of the most important is a traffic sniffer that allows access to packages and therefore understanding how what information is flowing or what is going wrong during development phases.

Generally, the easiest way to understand a communication protocol is by reading its specifications. However, sometimes such information is not available (as it happens in this case), thus a way to learn about a protocol is to study the packages that are exchanged along the network.

Zenoh is a recent born technology, and it is hardly documented. The implementation and operation of the operation of the protocol is a hard task. It is necessary to decode the packages manually, and it is not a protocol which uses an unique specific port.

For all of these reasons, the Zenoh community decided to implement a dissector write in Lua for Wireshark, simplifying the debugging task without having to perform manual decoding.

As it was noted before, Zenoh uses the 7447 port for certain communication tasks. The first version available of the Zenoh dissector only decoded messages that were exchanged using this port. Nevertheless, this approach does not take into account a set of cases in which the protocol makes use of other ports. For example, when Zenoh uses TCP as transport protocol after the scouting phase, it establishes a TCP connection using random ports. Also when a node responses to a SCOUT message with a HELLO uses

35 Chapter 4. Data distribution random ports. For this reason, many packages were not captured by the dissector.

The usage of a dissector was very important for us to understand the capabilities provided by Zenoh, and to debug the problems we were finding during our tests. Thus, we decided to improve the existing code to recognise all the Zenoh messages that were exchanged in the network.

The use of random ports by the protocol opens the door required the use of heuristics to mark the potential packets with interested to be analysed. Wireshark supports the execution of heuristic dissectors that analyse the content of packets in order to determine whether it is a packet of our interest or not. Usually the first bytes of a packet are related with metadata as identifiers and flags, so if enough of this metadata match, the packet will be dissected.

Figure 4.4: Zenoh over TCP.

Figure 4.5: Zenoh over UDP.

The early two bytes of the TCP segments (see Figure 4.4) indicate the length of the Zenoh packet. For both UDP (see Figure 4.5) and TCP the flags are represented by the three MSBs while the four LSBs represent the type of message that the node receives (Zenoh or session message). Thus, the information provided by the flags, ids and transport protocol

36 4.4. Wireshark dissector type (UDP/TCP) made it possible to create a set of heuristics. We use the following heuristics to decide if a packet inspected belongs to Zenoh.

1. The payload length is at least 3 bytes for TCP packets.

2. The payload lenght is at least 2 bytes for UDP packets.

3. The Zenoh message identifier and it associated flags contained in the first byte of UDP packages correspond to possible values.

4. The Zenoh message identifier and it associated flags contained in the third byte of TCP packages correspond to possible values.

When an analysed packet matches the heuristics, the dissector decodes and displays its content in Wireshark.

This new functionality allows Wireshark to recognise the HELLO frames in the SCOUT phase as well as the message exchanged in a TCP session after the establishment of the Zenoh session between two nodes.

Figure 4.6 shows how the original dissector only allowed identifying Zenoh traffic addressed to the port 7447. However, as shown in Figure 4.7, thanks to our new heuristic, all the Zenoh messages are identified, not only the traffic addressed to the port 7447, but also the traffic using random ports. The blue colour is used for UDP packages and the pinkish tone for TCP packets.

Figure 4.6: Wireshark’s Zenoh dissector without heuristics.

37 Chapter 4. Data distribution

Figure 4.7: Wireshark’s Zenoh dissector with heuristics.

4.5 Zenoh in NextPerception

Zenoh is a novel technology that we consider that will contribute to the distributed network proposal proposed by NextPerception, so we decided to present it to the consortium of companies as an alternative to MQTT, proposed by other project partners. Initially, some partners were reluctant, mainly due to the lack of knowledge of this technology, its recent birth and the lack of documentation. However, GTI, and specifically the author of this dissertation, was willing to support the rest of the partners and offer help in integrating Zenoh into the different solutions provided by them at the time of integration. Furthermore, Zenoh offers different features that will be essential to build applications that can be dynamically distributed in a Mist network, and that will be much harder (if not impossible) to implement with MQTT.

Firstly, with the aim of explaining in more detail the advantages offered by Zenoh in comparison to the previously mentioned protocols, we built a demonstrator that was presented to the consortium.

It consisted of a basic example application in which several image streams were published through a pair of cameras and processed using two different image analysis algorithms. In addition, a streaming server published all the published streams, both original and processed. With this demonstrator, the aim was to demonstrate the capability to launch a new process for hot image processing, and support for real-time applications such as image processing in embedded devices such as RapsberryPi.

38 4.5. Zenoh in NextPerception

Figure 4.8 represents the architecture of the demonstrator. The Zenoh domain in presented in yellow and both subnets in green and red respectively: the GTI laboratory network which hosted the cameras and the streaming server, and the OpenStack [32] network that was used to manage our Cloud Computing infrastructure.

Figure 4.8: Diagram of demonstrator presented to the consortium.

Following the demonstration of this small example to the partners, we contacted one of the UC leaders to integrate Zenoh in one of the project demonstrators. The UC leader was implementing a demonstrator with different surveillance cameras that provided images to an artificial intelligence algorithm. This application detects different activities carried out by the recorded individuals.

We supported the UC leader during the integration of Zenoh and the development of their demonstrator. For this, we provided documentation and examples about how to use Zenoh (See Annex A) in their application.

One of the problems encountered by the partners was the addition of a new element to the demonstrator architecture. Basically, they collect the information from the cameras, send to Zenoh domain and process it with their own artificial intelligence algorithm. However, their cameras send a email to a specific direction when an event happens: it send 1 when one or more persons are detected or 0 when no one is detected. In order to prevent the publishing of empty frames (if there are no people in the cameras’s range) to the network and to reduce resource consumption in the AI processes , they wanted to add a Detection Service which parses the email and tells the camera to publish or not. Thus, it was proposed the solution shown in Figure 4.9. It was added two new publishers (one per flag camera) in the detection service and two subscriber (one of each in its camera process). By this way, the camera process will received a publication with a 0 (not to publish) or 1 (to publish).

Once this detection service was integrated and tested in the Zenoh domain, the resulting architecture was the presented in Figure 4.10. All the processes were deployed in docker

39 Chapter 4. Data distribution

Figure 4.9: Partial solution to the partner’s problem. containers over a Zenoh domain. Once some camera detects a person, it starts to publish information and the specific worker will start to process the images.

Figure 4.10: Final architecture of the integration with Zenoh.

After the first demonstrations, the partners of the project could understand the advantages provide by Zenoh to support the distributed intelligence paradigm, which is one of the main research objectives for NextPerception.

40 5 Results and validation

The previous chapters included a description of the functionalities and characteristics provided by Zenoh. Now, it is necessary to see the real capabilities that the protocol offers us when it comes to putting it into operation. To do so, we will measure some of the most important metrics to take into account as well as the possibilities of using another transport technology at the link level.

5.1 Metrics

In order to have consistent results in the benchmarking [33] and avoid introducing noise in the performance measurements of the different protocols to be evaluated, a computer was specifically configured for these tasks with the following characteristics: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz, 16 Gb of RAM and Ubuntu Linux 20.04.1 with kernel version 5.8.0-55-generic as operated system.

The configurations carried out were as follows:

1. Disable Turbo Boost. This feature raises the CPU operating frequency on demanding tasks. In order to avoid this variation of frequency over time, it is disabled.

2. Disable Hyper Threading. Actual CPU physical cores allow to run 2 or more threads of simultaneous executions. That approach means that both threads are sharing the CPU cache, so the performance of the process to test could be affected. In order to avoid this behaviour the hyperthreading is disabled.

3. Performance Scaling Governor Policy. The kernel by default may decide to save power. In order to avoid sub-nominal clocking it is recommended to use performance policy.

4. CPU affinity. In order to avoid the context changing and re-scheduling between CPU cores, the processes are bound to a specific CPU core during its execution.

41 Chapter 5. Results and validation

5. Process priority. In order to increase the priority of the process of interest to get more CPU time instead by the Linux scheduler, the process to test are launch with a priority 10 over 20.

To evaluate Zenoh and compare with other known protocols mentioned before in this work, it was selected two variables to evaluate the performance of Zenoh: throughput and latency.

The data was obtained using C, C++, Rust and Bash scripts, and later processed using R for data analytic results.

5.1.1 Throughput

We measure the throughput in a distributed network where Zenoh works as a backbone communication protocol: it geo-distributes the storage thanks to its plugins to add storage backends (SQL databases), to allow to push/sub/pull data, to share computation through the network with queries; the number of messages peer time’s unity (throughput) it is a relevant performance variable.

Zenoh has obtained high results despite its recent born as communication protocol. In Figures 5.1 and 5.2 it can be seen the difference between the two APIs provide by Zenoh (zenoh-net and zenoh).

Figure 5.1(a) shows the big difference between zenoh an zenoh_net in terms of messages per second for frames with a small payload while using the P2P (peer-to-peer) schema. However, as the payload size increases, the throughput stay similar between APIs. In the brokered case (Figure 5.1(b)) the difference between APIs is considerably smaller. Also, the difference between the throughput between P2P and client schema using zenoh API is small.

Figure 5.2 shows the same results but in GB/s. The highest throughput is almost 5.50 GB/s, level obtained for a payload of 8192 bytes. There is no a significant difference between zenoh and zenoh_net, specially for the brokered alternative.

In order to facilitate the comparison between the performance of the peer to peer and brokered schemes, Figure 5.3 shows the messages per second (as millions of messages per second) and the throughput in GB/s.

After analysing the Zenoh results individually, we compare the performance with other alternatives: ZeroMQ, MQTT and DDS. MQTT and DDS are protocols regulated by a standard. There are several implementations coming from different companies. In this case it is used the RTi implementation for DDS [34] and MQTT Paho [35].

42 5.1. Metrics

P2P − Localhost Zenoh Broker − Localhost 3.00 ● 1.25 ● ● ● ● 1.20 2.75 ● ● 1.15 ● ● ● ● 1.10 ● ● 2.50 ● 1.05 ● ● 1.00 ● 2.25 ● ● 0.95 ● 0.90 2.00 ● 0.85 ● ● 0.80 1.75 Protocol ● Protocol 0.75 ● Zenoh−net Zenoh−net ● ● ● ● ● 0.70

Mmsg/s 1.50 Zenoh Mmsg/s ● Zenoh ● 0.65 ● ● 0.60 1.25 ● 0.55 ● ● ● 0.50 ● 1.00 0.45 ● 0.40 0.75 0.35 ● ● ● 0.30 0.50 0.25 ● 0.20 ● 0.25 0.15 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 Payload (bytes) Payload (bytes) (a) Zenoh Peer-to-peer (b) Zenoh brokered

Figure 5.1: Zenoh throughput (Millions of messages per second).

P2P − Localhost Zenoh Broker − Localhost

5.50 ● 3.25 ● 5.25 ● ● ● ● 5.00 3.00 4.75 2.75 ● 4.50 ● 4.25 ● 2.50 4.00 3.75 2.25 3.50 ● ● 2.00 ● 3.25 ● 3.00 Protocol Protocol 1.75 2.75 Zenoh−net Zenoh−net GB/s GB/s 2.50 Zenoh 1.50 Zenoh ● 2.25 ● ● 1.25 2.00 ● 1.75 1.00 1.50 ● ● ● 1.25 ● 0.75 1.00 0.50 ● 0.75 ● ● ● 0.50 ● ● 0.25 ● 0.25 ● ● ● ● ● 0.00 ● ● ● 0.00 ● ● ●

8 16 32 64 128 256 512 1024 2048 4096 8192 16384 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 Payload (bytes) Payload (bytes) (a) Zenoh Peer-to-peer (b) Zenoh brokered

Figure 5.2: Zenoh throughput (GB/s).

43 Chapter 5. Results and validation

Zenoh P2P vs Zenoh broker Zenoh P2P vs Zenoh broker 3.0 5.50 ● ● 2.9 ● 5.25 ● 2.8 ● ● ● ● 5.00 2.7 2.6 ● 4.75 2.5 ● 4.50 ● 2.4 4.25 2.3 ● 4.00 2.2 2.1 3.75 2.0 3.50 ● ● 1.9 3.25 ● ● Protocol 1.8 Protocol 3.00 ● Zenoh−net 1.7 Zenoh−net 1.6 ● 2.75 ● Zenoh ● ● ● ● ● Zenoh 1.5 GB/s ● ● 2.50 Zenoh−net−b Mmsg/s 1.4 ● Zenoh−net−b 2.25 Zenoh−b 1.3 Zenoh−b ● ● ● ● ● 2.00 ● ● 1.2 ● ● ● 1.1 ● ● ● ● 1.75 ● ● ● 1.0 ● ● 1.50 ● ● 0.9 ● ● ● ● ● 1.25 ● 0.8 ● 1.00 0.7 ● ● ● 0.6 ● 0.75 ● ● 0.5 ● 0.50 ● 0.4 ● ● ● 0.25 ● ● 0.3 ● ● ● ● ● 0.00 ● ● ● ● 0.2 ● 0.1 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 Payload (bytes) Payload (bytes) (a) Throughput in GB/s (b) Throughput in Mmsg/s

Figure 5.3: Comparison between Zenoh P2P and Zenoh brokered.

Figures 5.4 and 5.5 shows the throughput obtained in the tests done in the benchmarking server configured.

As can be seen in the graph below, Zenoh is practically unrivalled in terms of throughput. The representation of DDS starts at payload 32 due to the fact that since a licence is required, a free tool has been used for those that provide and do not allow payloads smaller than 28 bytes. It is true that at lower payload levels it does not stand out from the rest of the protocols, but as we increase the payload size, the difference grows in such a way that both MQTT and DDS fall out of range. However, we see in 5.5 that ZeroMQ generates a much higher number of messages per second than any of the protocols up to payloads of 512 bytes.

Finally, we can see how, in terms of throughput, Zenoh stands out considerably. However, performance measurements are not usually the unique characteristics that we should have into account for selecting a protocol for a specific application.

For example, for this work, Zenoh has been assessed as the best choice due to a number of distinctive characteristics:

• With regard to DDS, as mentioned in the state of the art, DDS was designed and standardised at the time with multicast network communications in mind, but not as a communications protocol for networks that can not route that multicast traffic. Zenoh was born as a project trying to improve DDS weaknesses. Furthermore, in the case where it is necessary to use also DDS, Zenoh provides a plugin to act as a gateway from the Zenoh domain to DDS.

44 5.1. Metrics

Protocol comparation

5500 ● 5250 ● ● ● 5000 4750 4500 ● 4250 4000 3750 Protocol 3500 ● Zenoh−net 3250 ● ● Zenoh 3000 ● MQTT 2750 ●

MB/s ● Zmq 2500 Zenoh−net−b 2250 ● ● ● Zenoh−b 2000 ● ● ● DDS 1750 ● ● 1500 ● ● 1250 ● 1000 ● ● 750 ● ● 500 ● ● ● ● ● 250 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ●

8 16 32 64 128 256 512 1024 2048 4096 8192 16384 Payload (bytes)

Figure 5.4: Throughput in MB/s

Protocol comparation

5.00 ● ● 4.75 ●

4.50 4.25 ● 4.00

3.75

3.50

3.25 Protocol 3.00 ● ● Zenoh−net ● ● Zenoh 2.75 ● ● ● ● MQTT 2.50 ● Zmq Mmsg/s 2.25 ● Zenoh−net−b 2.00 ● ● Zenoh−b 1.75 DDS ● ● ● ● ● 1.50 ● ● ● ● 1.25 ● ● ● ● ● ● ● ● ● ● ● ● ● 1.00 ● ● ● ● ● ● ● ● 0.75 ● ● 0.50 ● ● ● ● 0.25 ● ● ● ● 0.00 ● ● ● ● ● ● ● ● ● ● ● ●

8 16 32 64 128 256 512 1024 2048 4096 8192 16384 Payload (bytes)

Figure 5.5: Throughput in Mmsg/s.

45 Chapter 5. Results and validation

• ZeroMQ has very good ratings, as well as being used by large companies such as Facebook, Microsoft, etc. However, it offers a lower level of abstraction than Zenoh and DDS, i.e. it is more flexible but the application itself has to implement and orchestrate patterns to obtain this higher level. For example, obtaining reliable communication in pub/sub requires the combination of ZeroMQ patterns to be used whereas in Zenoh or DDS it is configured by default.

• MQTT uses a central element, called broker, that greatly limits the proposed architecture of this work. The broker would centralise communications with its own pros and cons (simplified architecture, but also a unique point of failure and the need for a permanent connection).

5.1.2 Latency

As discussed throughout this document, latency is a very important parameter for distributed applications where there are severe time constraints. Thanks to the use of a RTOS, we can have an almost deterministic operation, but communications introduce unwanted latencies, so knowing the timing of a protocol is necessary when evaluating it for a possible application.

For latency measurements, we used the same testing environment as in the throughput measurement experiments. To carry out the measurements, a publishing process and a subscribing process were used to exchange a message originated in the former and returned by the latter (hereafter referred to as a ping-pong process). One of the problems with this type of measurement in current operating systems is the change of context of the processes. If we reduce the frequency for the ping-pong between the publisher and subscriber, the process will probably be descheduled by the kernel. In that case, what we are going to measure is actually the kernel context switch. Nevertheless, it is interesting to know what latency it is achieved when the publication is done slower because not all devices are always going to produce information at high frequencies and in this case, what it is going to dominate is mostly the operating system.

The results shown in Figures 5.6, 5.7, 5.8 and 5.9 are organised as follows. Each coloured graph corresponds to a specific payload in the Zenoh messages. Within each graph, five 60-second runs were made for each selected interval. From all the results obtained for each message sending frequency, the median was calculated. It was decided to use the median as opposed to the mean mainly to avoid the influence of possible outliers that could appear at a value considerably above the mean of the normal distribution.

The first pair of graphs in Figures 5.6 and 5.7 shows the latency in a P2P scenario in a ping-pong process in the benchmarking machine. As it can see, the minimum latency is achieved when the payload is 1024 bytes. Furthermore, the delay difference between frequencies of 100 and 1000 messages per second is of the order of 500-600 us. It can

46 5.2. Zenoh over BLE be also seen that the RTT from 1000 msg/s has a stabilised value. The reason for the decrease in the latency when the number of messages per second increases, is due to the fact that when performing the ping and pong measurement between two processes at low frequencies, most of the time the operating system is rescheduling the processes, and thus generating additional latency.

RTT measure − Zenoh−net P2P 700 ● ● 650 ● ● ● ● 600 ● ● ● ● ● 550

500 Protocol 450 8 16 400 32

350 64 RTT (us) RTT 128 300 1024 16384 250

200

150

100 ● 50 ● ● ● ●

1 10 100 1000 10000 1e+05 1e+06 msg/s

Figure 5.6: Zenoh-net API in P2P mode (localhost).

The difference between zenoh-net and zenoh is very small. It can be seen how the RTT value is a bit higher, but the difference is not as big as it was in the throughput.

Figures 5.8 and 5.9 displays the same graphs for the scenario of Zenoh as a client and with a Zenoh router as a broker.

In this case the latency shoots up to values around 1200 microseconds around the lowest frequency values and from 1000 messages per second it stabilises as in the case of the P2P scenario but with almost tripled values. Comparing the APIs, the difference is almost non-existent for this scenario.

5.2 Zenoh over BLE

In the introduction, the different use cases proposed by NextPerception and the demon- strators associated with each one were discussed. As mentioned in the introduction, the use case in which the GTI must make contributions is UC1. The UC1 proposal has Bluetooth Low Energy as the link level protocol for part of the sensor network.

47 Chapter 5. Results and validation

RTT measure − Zenoh P2P

● ● 700 ● ● ● ● 650 ● ●

600

550

500 Protocol

8 450 16

400 32 64 RTT (us) RTT 350 128 1024 300 16384

250

200

150

100 ● ● ● ● 50

1 10 100 1000 10000 1e+05 1e+06 msg/s

Figure 5.7: Zenoh API in P2P mode (localhost).

RTT measure − Zenoh−net broker

1250 ● 1200 ● ● ● ● 1150 ● ● 1100 ● ● ● ● ● 1050 1000 950 900 850 Protocol

800 8 750 16 700 32 650 64 RTT (us) RTT 600 128 550 1024 500 16384 450 400 350 300 250 200 150 100 ● ● ● ● 50 1 10 100 1000 10000 1e+05 1e+06 msg/s

Figure 5.8: Zenoh-net API in client mode (localhost).

48 5.2. Zenoh over BLE

RTT measure − Zenoh broker 1350 ● 1300 ● 1250 ● ● ● 1200 ● ● ● ● ● 1150 ● ● 1100 1050 1000 950 900 Protocol 850 8 800 16 750 32 700 64

RTT (us) RTT 650 128 600 1024 550 16384 500 450 400 350 300 250 200 150 100 ● ● ● ● 50 1 10 100 1000 10000 1e+05 1e+06 msg/s

Figure 5.9: Zenoh API in client mode (localhost).

Since we decided to use Zenoh as a novel protocol for the architecture proposed by NextPerception and differentiating with respect to the DDS and MQTT protocols, it was necessary to check the possibility of enabling BLE as a link protocol for Zenoh.

Firstly, and related to the previous section, the possibility of adding BLE as a transport to Zenoh-pico was studied, using some RTOS as a platform. However, it was already seen that due to the implementation of the Zenoh client solution for embedded systems, integrating it in non-Unix systems was not viable due to the dependencies with the libraries of this type of operating systems.

On the other hand, the possibility of adding to Zenoh the BLE stack for communication between nodes running the complete implementation written in Rust was also studied. For this purpose, a meeting was scheduled with one of the ADLINK developers working directly on the message protocol.

The simplest scenario that came up in the meeting was that of a Zenoh router as GATT client and Zenoh clients as GATT servers. In order to make things even easier, it was proposed to start from the Bluetooth LE stack provided by current Unix systems (such as Ubuntu): BlueZ, the Linux Bluetooth Daemon. In this way, we could have a first approximation of a solution for a large community and avoid implementing a complete stack in Rust for Bluetooth, a design and implementation task that would demand a lot of human resources and time. Therefore, code reuse and a bridge between Rust and BlueZ was sought.

49 Chapter 5. Results and validation

As Zenoh is written entirely in Rust, in order to reuse code implementations of Bluetooth LE, the different APIs available in crates.io, the Rust community registry, were reviewed. Of all the options available, the most comprehensive was bluez_async. Moreover, as its name suggests, it is programmed using the asynchronous scheduling system provided by Rust, which also uses Zenoh and allows for much higher performance [36].

The library bluz_async is an asynchronous wrapper of the BlueZ D-Bus interface, which provides interfaces to a subset of Bluetooth client interfaces exposed by BlueZ, focusing on Generic Attribute Profile(GATT) and BLE. However, this crate (the name used to denote dependencies in Rust) only supports client interfaces, but not server interfaces. In other words, if Zenoh were to implement a BLE client service, it would be aimed at connecting devices that do run a server service (music services, handsfree, etc.).

As this work is part of the technology verification work and the assessment of the best candidates, in this case, although Zenoh had been chosen as a novel technology, the fact that it had the limitation of not supporting BLE for the time being was not a negative point.

Therefore, it was decided to wait for possible contributions and updates of the bluz_async library repository for a possible Zenoh BLE implementation. Both ADLINK and GTI saw this decision as the right one and as a possible future collaboration after considering how to address this implementation. The coding of the BlueZ D-Bus interfaces in Rust should not be a very time-consuming job for someone with Rust knowledge.

50 6 Conclusions

This thesis addresses scenarios that will have a major impact in future years, such as distributed computing and computing at the edge. In this chapter, we summarise the main contributions of the work as well as propose some future directions.

The work undertaken over the last few months made it possible to raise awareness of one of the technologies with potential success for communications, Zenoh, with such ambitious objectives as standardisation. This technology is specially relevant to achieve the objectives of NextPerception, an H2020 research project, in regard to the creation of a distributed intelligence layer.

The use of a centralised computation paradigm for NextPerception is not suitable. The state of art show us the advantages that brings a decentralised and distributed communi- cations through a hierarchical architecture, allowing the implementation of applications that take advantage of all the computing power distributed in a network, being able to operate even with Internet connectivity, or using the near devices to obtain results within a strict deadline (low-latency).

Moreover, Zenoh provides new interesting features, such as the support for data at rest and data in motion (which fits the dynamics of the Mist Computing paradigm in terms of data management), or the use of the NDN architecture, and the possibility to share computing power through some primitives.

We have also found different points for improvement in Zenoh. The first is related to the documentation, very limited at this moment. The second is related to the debugging process during the development, so we enhanced the Wireshark dissector using heuristics. This dissector also helped us to understand how Zenoh works. In terms of support for embedded devices and support for other link protocols such as Bluetooth LE, Zenoh is still limited. Nevertheless, we have started to study how to include such support in collaboration with ADLINK.

51 Chapter 6. Conclusions

Finally, it is important to note that the NextPerception project is including Zenoh as one of the enabling technologies, and some other partners are starting to use Zenoh for the demonstrators.

6.1 Future work

This thesis is framed in a research project. Thus, there is still a lot of pending research and implementation work towards the final objective: the creation of a distributed intelligence architecture.

For example, one of the next steps is the creation of an orchestration layer. As noted before, FogØ5 is our main candidate for orchestrating the architecture, thanks to its capability to manage and automate the deployment of applications. Nevertheless, we will wait until its development reaches a stable version.

Other important task is the design and development of an abstraction layer that will allow developers to create distributed applications without requiring any knowledge about the underlying network and the location of the resources.

Finally, on the Zenoh side, we plan to collaborate in the implementation of Zenoh-Pico for embedded systems, using an abstraction of the session layer to allow migrating this implementation in a faster and easier way than having to develop hardware-specific or RTOS specific ports.

52 Bibliography

[1] Xun Xu. “From cloud computing to cloud manufacturing”. In: Robotics and Computer- Integrated Manufacturing 28.1 (2012), pp. 75–86. issn: 0736-5845. doi: https : //doi.org/10.1016/j.rcim.2011.07.002. url: https://www.sciencedirect.com/science/ article/pii/S0736584511000949. [2] Paul Haskell-Dowland. Fastly global internet outage: why did so many sites go down — and what is a CDN, anyway? url: https://theconversation.com/fastly-global- internet-outage-why-did-so-many-sites-go-down-and-what-is-a-cdn-anyway- 162371. [3] Sam Tonkin. Amazon, Spotify, Reddit and Twitch are DOWN. url: https://www. dailymail.co.uk/sciencetech/article-9663753/Amazon-Spotify-Reddit-Twitch- DOWN.html. [4] Michael A Cusumano. “The cloud as an innovation platform for software develop- ment”. In: Communications of the ACM 62.10 (2019), pp. 20–22. [5] Flavio Bonomi et al. “Fog computing and its role in the internet of things”. In: Proceedings of the first edition of the MCC workshop on Mobile cloud computing. 2012, pp. 13–16. [6] Zhifeng Xiao and Yang Xiao. “Security and Privacy in Cloud Computing”. In: IEEE Communications Surveys Tutorials 15.2 (Second 2013), pp. 843–859. issn: 1553-877X. doi: 10.1109/SURV.2012.060912.00182. [7] Minqi Zhou et al. “Security and Privacy in Cloud Computing: A Survey”. In: 2010 Sixth International Conference on Semantics, Knowledge and Grids. Nov. 2010, pp. 105–112. doi: 10.1109/SKG.2010.19. [8] Daniel Maniglia A da Silva et al. “An analysis of fog computing data placement algorithms”. In: Proceedings of the 16th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services. 2019, pp. 527–534. [9] Jose Santos et al. “Fog Computing: Enabling the Management and Orchestration of Applications in 5G Networks”. In: (2017). url: https : / / www . researchgate . net / publication / 322593370 _ Fog_ Computing _ Enabling _ the _ Management_and_Orchestration_of_Smart_City_Applications_in_5G_ Networks.

53 Bibliography

[10] Manas Kumar Yogi, K Chandrasekhar, and G Vijay Kumar. “Mist computing: Principles, trends and future direction”. In: arXiv preprint arXiv:1709.06927 (2017). [11] Yongqiang Huang and Hector Garcia-Molina. “Publish/Subscribe in a Mobile Environment”. url: https://doi.org/10.1023/B:WINE.0000044025.64654.65. [12] Gerardo Pardo-Castellote. “OMG Data-Distribution Service (DDS): Architectural Overview”. url: https://www.researchgate.net/publication/4070720_OMG_Data- Distribution_Service_architectural_overview. [13] “Data-Distribution Service for Real-Time Systems”. url: https://www.omg.org/ spec/DDS/1.4. [14] Akram Hakiri et al. “Publish/subscribe-enabled software defined networking for efficient and scalable IoT communications”. In: IEEE communications magazine 53.9 (2015), pp. 48–54. [15] RTI Core Libraries and Utilities QoS Reference Guide. url: https://community. rti.com/rti-doc/500/ndds.5.0.0/doc/pdf/RTI_CoreLibrariesAndUtilities_QoS_ Reference_Guide.pdf. [16] Georg Aures and Christian Lübben. “DDS vs. MQTT vs. VSL for IoT”. In: Network 1 (2019). [17] DDS Fundation. How does DDS compare to other IoT Technologies? url: https: //www.dds-foundation.org/features-benefits/. [18] ADLINK. Zenoh: Key concepts. 2021. url: https : / / zenoh . io / docs / getting - started/key-concepts/. [19] Angelo Corsaro. Zenoh: zero overhead pub/sub store/query compute. url: https: //www.slideshare.net/Angelo.Corsaro/zenoh-zero-overhead-pubsub-storequery- compute. [20] David Andrews et al. “Impact of embedded systems evolution on RTOS use and design”. In: Proceedings of the 1st International Workshop Operating System Plat- forms for Embedded Real-Time Applications (OSPERT’05) in conjunction with the 17th Euromicro International Conference on Real-Time Systems (ECRTS’05). 2005, pp. 13–19. [21] The Apache Software Foundation. NuttX Documentation. url: https://nuttx. apache.org/docs/latest/index.html. [22] Zephyr Project members and individual contributors. Zephyr Project Documentation. url: https://docs.zephyrproject.org/latest/. [23] Inc. Amazon Web Services. FreeRTOS Documentation. url: https://www.freertos. org/Documentation/RTOS_book.html. [24] Adam Dunkels and Leon Woestenberg. FatFs - Generic FAT Filesystem Module. url: http://elm-chan.org/fsw/ff/00index_e.html. [25] Lightweight IP Stack. url: https://www.nongnu.org/lwip/2_1_x/index.html.

54 Bibliography

[26] Luciano Baresi et al. “A unified model for the mobile-edge-cloud continuum”. In: ACM Transactions on Internet Technology (TOIT) 19.2 (2019), pp. 1–21. url: https://dl.acm.org/doi/pdf/10.1145/3226644. [27] Ali Sunyaev. “Fog and edge computing”. In: Internet Computing. Springer, 2020, pp. 237–264. [28] Juan José López Escobar, Rebeca P Dıaz Redondo, and Felipe Gil-Castiñeira. “Mist Computing: an in-depth analysis and open challenges”. In: (2021). [29] Steve Klabnik and Carol Nichols. The Rust Programming Language. url: https: //doc.rust-lang.org/book/. [30] AKM Mahmudul Hoque et al. “NLSR: Named-data link state routing protocol”. In: Proceedings of the 3rd ACM SIGCOMM workshop on Information-centric networking. 2013, pp. 15–20. [31] Alex Afanasyev et al. “A Brief Introduction to Named Data Networking”. In: MILCOM 2018 - 2018 IEEE Military Communications Conference (MILCOM). 2018, pp. 1–6. doi: 10.1109/MILCOM.2018.8599682. [32] Open Infrastructure Foundation. Open Source Cloud Computing Infrastructure. url: https://www.openstack.org/. [33] Denis Bakhalov. How to get consistent results when benchmarking on Linux? 2019. url: https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on- Linux. [34] Real-Time Innovations. DDS: An Open Standard for Real-Time Applications. url: https://www.rti.com/products/dds-standard. [35] Eclipse Foundation. Eclipse Paho MQTT Python Client. url: https://github.com/ eclipse/paho.mqtt.python. [36] ADLINK. Zenoh performance: a stroll in Rust async wonderland. 2021. url: https: //zenoh.io/blog/2021-07-13-zenoh-performance-async/.

55

Appendices

57

Zenoh demonstrator

The following document was prepared as a tutorial and proof of concept for the consor- tium’s partners to understand how Zenoh works. In addition, the proof of concept was very similar to the solution proposed by them, which made it easier for them to integrate Zenoh with their work.

The following describes how to deploy a face and people detection application.For this, the steps to deploy the cameras as well as the image analysis algorithms are given. It explains how to deploy the same example with two configurations:

• In a local network, creating a P2P network.

• In two networks, where the cameras are in one domain and the image processing algorithms in another. Here you can see the usefulness of the Zenoh routers.

A.1 Prerequisites

• Python3 (tested with 3.6, 3.7, 3.8 and 3.9.).

• pip 19.3.1 minimum (for full support of PEP 517).

• OpenCV.

• Linux x86_64 system.

• Ubuntu 20.04 (tested) / 18.04 (not tested by Uvigo).

• Rust (nightly toolchain).

• Two webcams and two computers.

A.1.1 OpenCV

1 $ sudo apt-get install cmake 2 $ sudo apt-get install gcc g++

59 Appendix A. Zenoh demonstrator

3 4 $ sudo apt-get install python3-dev python3-numpy 5

6 # To support GUI features 7 $ sudo apt-get install libavcodec-dev libavformat-dev libswscale-dev 8 $ sudo apt-get install libgstreamer-plugins-base1.0-dev libgstreamer1.0-dev 9

10 # To support GTK 2 11 $ sudo apt-get install libgtk2.0-dev 12

13

14 # To support GTK 3 15 $ sudo apt-get install libgtk-3-dev 16

17

18 #Optional dependencies 19 $ sudo apt-get install libpng-dev 20 $ sudo apt-get install libjpeg-dev 21 $ sudo apt-get install libopenexr-dev 22 $ sudo apt-get install libtiff-dev 23 $ sudo apt-get install libwebp-dev 24

25 # Clone OpenCV repo 26 $ sudo apt-get install git 27 $ git clone https://github.com/opencv/opencv.git 28

29 #Compile 30 $ cd opencv 31 $ mkdir build 32 $ cd build 33 $ cmake ../ 34 $ make 35

36 #Install 37 $ sudo make install

In order to test if the installation was correct:

1 $ python3 2 3 > import cv2 as cv 4 > print(cv.__version__)

60 A.2. Deployment

A.1.2 Rust nightly

1 $ sudo apt install build-essential libssl-dev pkg-config -y 2 $ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs -o /tmp/rust.sh&& chmod +x ,→ /tmp/rust.sh 3 $ /tmp/rust.sh --default-toolchain nightly -y 4 $ source $HOME/.cargo/env

A.1.3 Zenoh-python

1 $ pip3 install eclipse-zenoh 2

3 # In case of have the lib installed, update to latest version 4 $ pip3 install eclipse-zenoh -U

A.2 Deployment

In the diagram above, a VPN was used in order to establish a tunnel with an Openstack VIM and zenoh router to interconnect the networks. However, it is possible to configure a peer-to-peer network over the same network domain or to use Zenoh routers to interconnect different domains. Then it will be explained by both the peer and router deployment.

A.2.1 Peer deployment

The next steps are focused on the deployment of a peer to peer network without the use of Zenoh routers.

Clone the repository with the code:

1 # This repo is private. Please request authorization to [email protected] 2 $ git clone https://github.com/expploitt/camera_processing_zenoh.git

A.2.1.1 Publishers

The first component to deploy is the camera publishers. It is mandatory to install OpenCV as it is used for the camera caption and JPEG compression. To run the publisher in peer mode publishing the information in the path /nextperception/camera1:

61 Appendix A. Zenoh demonstrator

Figure 1: P2P deployment diagram.

1 $ cd camera_processing_zenoh 2 $ python3 publishers/zn_pub_camera_data.py -m peer -p /nextperception/camera1

To run the other publisher in peer mode publishing the information in the path /nextper- ception/camera2:

1 $ cd camera_processing_zenoh 2 $ python3 publishers/zn_pub_camera_data.py -m peer -p /nextperception/camera2

The usage of the script is the following.

1 # usage: zn_pub_camera_data.py [-h] [--mode {peer,client}] [--peer LOCATOR] [--listener ,→ LOCATOR] [--path PATH] [--value VALUE] [--config FILE] 2 # 3 # zenoh-net pub example 4 # 5 # optional arguments: 6 # -h, --help show this help message and exit 7 # --mode {peer,client}, -m {peer,client} 8 # The zenoh session mode. 9 # --peer LOCATOR, -e LOCATOR 10 # Peer locators used to initiate the zenoh session.

62 A.2. Deployment

11 # --listener LOCATOR, -l LOCATOR 12 # Locators to listen on. 13 # --path PATH, -p PATH The name of the resource to publish. 14 # --value VALUE, -v VALUE 15 # The value of the resource to publish. 16 # --config FILE, -c FILE 17 # A configuration file.

A.2.1.2 Workers

The workers are that process that involves the image analysis computation. For that, it subscribes to the information to process (in this case the camera 1 or 2), it processes the image and finally it publishes to some path to the Zenoh domain.

The usage of the command is the following:

1 # usage: face_detection.py [-h] [--mode {peer,client}] [--peer LOCATOR] [--listener ,→ LOCATOR] [--selector SELECTOR] [--path PATH] # [--config FILE] 2 # 3 # face_detection pub/sub example 4 # 5 # optional arguments: 6 # -h, --help show this help message and exit 7 # --mode {peer,client}, -m {peer,client} 8 # The zenoh session mode. 9 # --peer LOCATOR, -e LOCATOR 10 # Peer locators used to initiate the zenoh session. 11 # --listener LOCATOR, -l LOCATOR 12 # Locators to listen on. 13 # --selector SELECTOR, -s SELECTOR 14 # The selection of resources to subscribe. 15 # --path PATH, -p PATH The name of the resource to publish. 16 # --config FILE, -c FILE 17 # A configuration file.

In our case, we are going to use the -m (peer mode), -p (the path where we want to publish the processed information) and -s (the path to subscribe) flags to run the command.

There are two worker:

• face_detection. It recognizes faces and draws a blue rectangular over it.

• person_detection. It counts the people recognised in the image.

63 Appendix A. Zenoh demonstrator

In order to launch the face_detection for the pub/sub paths.

1 # Example1 : Face detection for the stream flow of the camera 1 and publish to the ,→ camera1_worker path. 2 $ python3 workers/face_detection.py -m peer -s /nextperception/camera1 -p ,→ /nextperception/camera1_worker 3

4 # Example2 : Face detection for the stream flow of the camera 2 and publish to the ,→ camera2_worker path. 5 $ python3 workers/face_detection.py -m peer -s /nextperception/camera2 -p ,→ /nextperception/camera2_worker 6

7 # Example3 : Face detection for the stream flow of the camera 1 and publish to the ,→ camera2_worker path. 8 $ python3 workers/face_detection.py -m peer -s /nextperception/camera1 -p ,→ /nextperception/camera2_worker 9

10 # Example4 : Face detection for the stream flow of the camera 2 and publish to the ,→ camera1_worker path. 11 $ python3 workers/face_detection.py -m peer -s /nextperception/camera2 -p ,→ /nextperception/camera1_worker

In order to launch the person_detection for the pub/sub paths.

1 # Example1 : Face detection for the stream flow of the camera 1 and publish to the ,→ camera1_worker path. 2 $ python3 workers/person_detection.py -m peer -s /nextperception/camera1 -p ,→ /nextperception/camera1_worker 3

4 # Example2 : Face detection for the stream flow of the camera 2 and publish to the ,→ camera2_worker path. 5 $ python3 workers/person_detection.py -m peer -s /nextperception/camera2 -p ,→ /nextperception/camera2_worker 6

7 # Example3 : Face detection for the stream flow of the camera 1 and publish to the ,→ camera2_worker path. 8 $ python3 workers/person_detection.py -m peer -s /nextperception/camera1 -p ,→ /nextperception/camera2_worker 9

10 # Example4 : Face detection for the stream flow of the camera 2 and publish to the ,→ camera1_worker path. 11 $ python3 workers/person_detection.py -m peer -s /nextperception/camera2 -p ,→ /nextperception/camera1_worker

64 A.2. Deployment

A.2.1.3 Streaming server

The streaming server is a basic Flask web server that allows the user to see the four publishing paths mentioned before. It is hard-coded the subscriptions paths, so it is necessary to change the python script in order to change the paths mentioned before. To execute the script:

1 $ python3 streaming_server/streaming_server.py -m peer 2

3 # Open in your browser localhost:2204/

A.2.2 Mixed topology

Figure 2: Example of a mixed Zenoh topology.

A.2.2.1 Zenoh router

This element is necessary when we want to connect the devices as clients or to inter- connect different domain networks.

To install the Zenoh router:

1 $ wget ,→ https://download.eclipse.org/zenoh/zenoh/latest/zenoh-storages_0.5.0~beta.8_amd64.deb 2 $ wget https://download.eclipse.org/zenoh/zenoh/latest/zenoh-rest_0.5.0~beta.8_amd64.deb 3 $ wget https://download.eclipse.org/zenoh/zenoh/latest/zenohd_0.5.0~beta.8_amd64.deb 4 5 $ sudo dpkg -i zenohd_0.5.0~beta.8_amd64.deb 6 $ sudo dpkg -i zenoh-storages_0.5.0~beta.8_amd64.deb

65 Appendix A. Zenoh demonstrator

7 $ sudo dpkg -i zenoh-rest_0.5.0~beta.8_amd64.deb 8

9 # To run the zenoh router, in case of interconnect zenoh routers, it is necessary to use ,→ the -e flag in each zenoh router instance pointing to each neighbor. 10 $ RUST_LOG=debug zenohd[-e tcp/router_peer_ip:7447]

We are going to suppose the next scenario:

Figure 3: Architecture of the example.

First, we are going to interconnect the routers between them:

1 # Run this command in the zenoh router 1 2 $ RUST_LOG=debug zenohd -e tcp/10.0.1.3:7447 3

4

5 # Run this command in the zenoh router 2 6 $ RUST_LOG=debug zenohd -e tcp/10.0.1.2:7447

Clone the repository with the code:

1 # This repo is private. Please request authorization to [email protected] 2 $ git clone https://github.com/expploitt/camera_processing_zenoh.git

A.2.2.2 Publishers

The first component to deploy is the camera publishers. It is mandatory to install OpenCV as it is used for the camera caption and JPEG compression.

66 A.2. Deployment

To run the publisher in peer mode publishing the information in the path /nextpercep- tion/camera1:

1 $ cd camera_processing_zenoh 2 $ python3 publishers/zn_pub_camera_data.py -m peer -p /nextperception/camera1 -e ,→ tcp/10.0.1.2:7447

To run the other publisher in peer mode publishing the information in the path /nextper- ception/camera2:

1 $ cd camera_processing_zenoh 2 $ python3 publishers/zn_pub_camera_data.py -m peer -p /nextperception/camera2 -e ,→ tcp/10.0.1.2:7447

The usage of the script is.

1 # usage: zn_pub_camera_data.py [-h] [--mode {peer,client}] [--peer LOCATOR] [--listener ,→ LOCATOR] [--path PATH] [--value VALUE] [--config FILE] 2 # 3 # zenoh-net pub example 4 # 5 # optional arguments: 6 # -h, --help show this help message and exit 7 # --mode {peer,client}, -m {peer,client} 8 # The zenoh session mode. 9 # --peer LOCATOR, -e LOCATOR 10 # Peer locators used to initiate the zenoh session. 11 # --listener LOCATOR, -l LOCATOR 12 # Locators to listen on. 13 # --path PATH, -p PATH The name of the resource to publish. 14 # --value VALUE, -v VALUE 15 # The value of the resource to publish. 16 # --config FILE, -c FILE 17 # A configuration file.

A.2.2.3 Workers

The workers are that process that involves the image analysis computation. For that, it subscribes to the information to process (in this case the camera 1 or 2), it processes the

67 Appendix A. Zenoh demonstrator

image and finally it publishes to some path to the Zenoh domain.

The usage of the command is the following:

1 # usage: face_detection.py [-h] [--mode {peer,client}] [--peer LOCATOR] [--listener ,→ LOCATOR] [--selector SELECTOR] [--path PATH] # [--config FILE] 2 # 3 # face_detection pub/sub example 4 # 5 # optional arguments: 6 # -h, --help show this help message and exit 7 # --mode {peer,client}, -m {peer,client} 8 # The zenoh session mode. 9 # --peer LOCATOR, -e LOCATOR 10 # Peer locators used to initiate the zenoh session. 11 # --listener LOCATOR, -l LOCATOR 12 # Locators to listen on. 13 # --selector SELECTOR, -s SELECTOR 14 # The selection of resources to subscribe. 15 # --path PATH, -p PATH The name of the resource to publish. 16 # --config FILE, -c FILE 17 # A configuration file.

In our case, we are going to use the -m (client mode), -p (the path where we want to publish the processed information), -s (the path to subscribe) and -e (the locator peer in order to know which routers are its peers) flags to run the command. There are two worker:

• face_detection. It recognizes faces and draws a blue rectangular over it.

• person_detection. It counts the people recognised in the image.

In order to launch the face_detection for the pub/sub paths.

1 # Example1 : Face detection for the stream flow of the camera 1 and publish to the ,→ camera1_worker path. 2 $ python3 workers/face_detection.py -m client -s /nextperception/camera1 -p ,→ /nextperception/camera1_worker -e tcp/10.0.1.3:7447 3

4 # Example2 : Face detection for the stream flow of the camera 2 and publish to the ,→ camera2_worker path.

68 A.2. Deployment

5 $ python3 workers/face_detection.py -m client -s /nextperception/camera2 -p ,→ /nextperception/camera2_worker -e tcp/10.0.1.3:7447 6

7 # Example3 : Face detection for the stream flow of the camera 1 and publish to the ,→ camera2_worker path. 8 $ python3 workers/face_detection.py -m client -s /nextperception/camera1 -p ,→ /nextperception/camera2_worker -e tcp/10.0.1.3:7447 9

10 # Example4 : Face detection for the stream flow of the camera 2 and publish to the ,→ camera1_worker path. 11 $ python3 workers/face_detection.py -m client -s /nextperception/camera2 -p ,→ /nextperception/camera1_worker -e tcp/10.0.1.3:7447

In order to launch the person_detection for the pub/sub paths.

1 # Example1 : Face detection for the stream flow of the camera 1 and publish to the ,→ camera1_worker path. 2 $ python3 workers/person_detection.py -m peer -s /nextperception/camera1 -p ,→ /nextperception/camera1_worker -e tcp/10.0.1.3:7447 3

4 # Example2 : Face detection for the stream flow of the camera 2 and publish to the ,→ camera2_worker path. 5 $ python3 workers/person_detection.py -m peer -s /nextperception/camera2 -p ,→ /nextperception/camera2_worker -e tcp/10.0.1.3:7447 6

7 # Example3 : Face detection for the stream flow of the camera 1 and publish to the ,→ camera2_worker path. 8 $ python3 workers/person_detection.py -m peer -s /nextperception/camera1 -p ,→ /nextperception/camera2_worker -e tcp/10.0.1.3:7447 9

10 # Example4 : Face detection for the stream flow of the camera 2 and publish to the ,→ camera1_worker path. 11 $ python3 workers/person_detection.py -m peer -s /nextperception/camera2 -p ,→ /nextperception/camera1_worker -e tcp/10.0.1.3:7447

A.2.2.4 Streaming server

The streaming server is a basic Flask web server that allows the user to see the four publishing paths mentioned before. It is hard-coded the subscriptions paths, so it is necessary to change the python script in order to change the paths mentioned before.

To execute the script:

69 Appendix A. Zenoh demonstrator

1 $ python3 streaming_server/streaming_server.py -m client -e tcp/10.0.1.2:7447 2

3 # Open in your browser localhost:2204/

A.3 Zenoh router installation

In this section it is shown the steps to configure the zenoh router software in arm and x86_64 architectures.

An example of a configuration of two routers is shown in the Mixed topology > Zenoh Router chapter.

A.3.1 Linux (x86_64)

The x86_64 version is updated and uploaded every release, so it is not necessary to compile the code in our machine.

To install the Zenoh router:

1 $ wget ,→ https://download.eclipse.org/zenoh/zenoh/latest/zenoh-storages_0.5.0~beta.8_amd64.deb 2 $ wget https://download.eclipse.org/zenoh/zenoh/latest/zenoh-rest_0.5.0~beta.8_amd64.deb 3 $ wget https://download.eclipse.org/zenoh/zenoh/latest/zenohd_0.5.0~beta.8_amd64.deb 4 5 $ sudo dpkg -i zenohd_0.5.0~beta.8_amd64.deb 6

7 # To run the zenoh router, in case of interconnect zenoh routers, it is necessary to use ,→ the -e flag in each zenoh router instance pointing to each neighbor. 8 $ RUST_LOG=debug zenohd[-e tcp/router_peer_ip:7447]

A.3.2 ARM (arm64)

The ARM version is not uploaded in the zenoh repositories, so it is necessary to compile it and also to create the .deb package.

In order to avoid configuration issues, it is available the compiled code in the cam- era_processing_zenoh github’s repository. To install the Zenoh router:

1 $ cd camera_processing_zenoh

70 A.3. Zenoh router installation

2 $ cd deb_packages 3 $ sudo dpkg -i zenohd_0.5.0-dev_arm64.deb 4

5 # To run the zenoh router, in case of interconnect zenoh routers, it is necessary to use ,→ the -e flag in each zenoh router instance pointing to each neighbor. 6 $ RUST_LOG=debug zenohd[-e tcp/router_peer_ip:7447]

71

Protocol Messages

This annex provides the types of messages in Zenoh. Two types of messages are distin- guished: Zenoh and Session messages.

B.1 Zenoh Messages

Zenoh messages are messages that are exchanged when the protocol session between two nodes has been established. These messages are those related to the declaration of resources (publication, subscription, query, etc.), the sending of data, a query, a request for certain data, or network link status.

Message Type Id Reliability Congestion Control DECLARE Zenoh Message 0x0b Reliable Block DATA Zenoh Message 0x0c Best-effort Drop QUERY Zenoh Message 0x0d Reliable Block PULL Zenoh Message 0x0e Reliable Block UNIT Zenoh Message 0x0f Best-effort Block LINK_STATE_LIST Zenoh Message 0x10 Reliable Block

Table 1: Zenoh messages.

Each message has a set of three bits as a flag, which serves to provide additional information to the receiver. The flags correspond to the three most significant bits of the first byte of the message in the case of UDP or the third byte for TCP (the first two bytes are used to indicate the length of the message).

There is another type of message that it could be attached to the Zenoh messages in order to add some information to the message. We distinguish:

• The Attachment can decorate any message (i.e. SessionMessage and ZenohMessage) and it allows to append to the message any additional information. Since the

73 Appendix B. Protocol Messages

Flag Flag Type Value Name Meaning D Zenoh 0x20 Dropping The message can be dropped F Zenoh 0x20 Final It is the final message (e.g. ReplyContext, Pull) I Zenoh 0x40 DataInfo DataInfo is presented K Zenoh 0x80 ResourceKey Only numerical ID N Zenoh 0x40 MaxSamples The MaxSamples is indicated P Zenoh 0x01 Pid The pid is presented R Zenoh 0x20 Reliable It concerns the reliable channel, best-effort otherwise S Zenoh 0x40 SubMode The declaration SubMode is indicated T Zenoh 0x20 QueryTarget The query target is presented

Table 2: Available flags for Zenoh messages.

information contained in the attachment is relevant only to the layer that provided them(e.g. Session, Zenoh, User) it is the duty of that layer to serialize and de-serialize the attachment whenever deemed necessary.

• The ReplyContext is a message decorator for either: The data messages that result from a query; A Unit message in case of the message is a SOURCE_FINAL or REPLY_FINAL.

• The RoutingContext is a mesasge decorator containing information for routing the concerned message. It contains the routing tree number, like a tag, for the forwarding in the router.

Message Type Id ROUTING_CONTEXT Message Decorator 0x1d REPLY_CONTEXT Message Decorator 0x1e ATTACHMENT Message Decorator 0x1f

Table 3: Zenoh message decorators.

B.2 Session Messages

Session messages are those used in the neighbour discovery phase (scouting), the session establishment phase between two Zenoh nodes, session maintenance and finally session closure.

74 B.2. Session Messages

Message Message Type ID SCOUT Session Message 0x01 HELLO Session Message 0x02 INIT Session Message 0x03 OPEN Session Message 0x04 CLOSE Session Message 0x05 SYNC Session Message 0x06 ACK_NACK Session Message 0x07 KEEP_ALIVE Session Message 0x08 PING_PONG Session Message 0x09 FRAME Session Message 0x0a ATTACHMENT Message Decorator 0x1f

Table 4: Session messages.

Like Zenoh messages, session messages can be embellished with flags to indicate to the recipient what additional information the message carries.

75 Appendix B. Protocol Messages

Flag Flag Type Value Name Meaning A Session 0x20 Ack The message is an ack C Session 0x40 Count The number of unacknowledged meessages is presented E Session 0x40 End It is the las FRAME message F Session 0x80 Fragment The FRAME is a fragment I Session 0x20 PeerID The PeerID is presented K Session 0x40 CloseLink Close the transport link only. L Session 0x80 Locator The locators are presented M Session 0x20 Mask A Mask is presented P Session 0x20 PingOrPong The message is Ping, otherwise is Pong R Session 0x20 Reliable It concerns the reliable channel, best-effort otherwise. S Session 0x40 SN resolution The SN Resolution is presented (SN = Se- quence Number) T Session 0x40 TimeRes The time resolution is in seconds W Session 0x40 WhatAmI WhatAmI is indicated X 0 - Unused flags are set to zero.

Table 5: Available flags for Zenoh session messages.

76