DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019

Fault-Tolerant Cloud Services Supervision system that ensures the safety of running processes

KEHAN MU

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Fault-Tolerant Cloud Services

KEHAN MU

Master in Embedded Software Date: September 18, 2019 Supervisor: Vinay Yadhav Examiner: Elena Dubrova School of Electrical Engineering and Computer Science Host company: Ericsson Swedish title: Feltoleranta molntjänster

iii

Abstract

Nowadays, due to the convenience of deployment, ease to scale up and cost savings, the application of cloud computing systems has spread throughout the factory, commercial and individual users. However, fault tolerance in cloud computing systems has always been an important topic due to the high failure rate caused by the sheer size of cloud computing systems. This thesis presents an implementation of a fault-tolerant system called "supervision system" as a fault-tolerant mechanism for cloud computing systems. We first proposed a supervisor-worker relation: a supervisor node is responsible for monitoring its child (worker or another supervisor), and the worker node which does the actual work periodically reset a timer in its supervisor. If the corresponding timer overflows, the supervisor marks it as a failure, and try to restore or restart a new instance of it. The system also supports a multi-watchdog mode, which uses more fine-grained watchdogs that group the threads in the worker and applies different strategies to the groups. Besides the local system, we also implemented a remote supervision system to ensure the safety of local root supervisors, by periodically saving its running state and uploading the image files to its remote supervisor. If an overflow occurs, the remote supervisor re- motely calls the restore function on the local machine. Then the restore func- tion gets the most recent image files from the remote supervisor and restores itself. In addition to the implementation details of the system, we designed several test cases and tested the speed of each system part. According to the results, we can conclude that the system works as we expected. iv

Sammanfattning

Idag används molnberäkningssystem över hela fabriken, kommersiella och en- skilda användare på grund av enkel installation, enkel expansion och kostnads- besparingar. Feltolerans i molnberäkningssystem har dock alltid varit ett vik- tigt ämne, eftersom den stora storleken på molnberäkningssystem har lett till höga felfrekvenser. Detta dokument introducerar implementeringen av ett fel- tolerant system som kallas ett "övervakat systemsom en feltolerant mekanism för molnberäkningssystem. Vi föreslår först ett förhållande mellan arbetsleda- re och arbetare: en handledare nod ansvarar för att övervaka sina barn (per- sonal eller annan handledare), och arbetsnoden som utför det verkliga arbetet återställer periodvis timern i sin handledare. Om motsvarande timer går över, markerar handledaren den som misslyckad och försöker återuppta eller starta om sin nya instans. Systemet stöder också ett flermonitorläge som använder finare skärmar som grupperar trådar i arbetaren och tillämpar olika policy- er för gruppen. Förutom det lokala systemet har vi implementerat ett fjärrö- vervakningssystem för att säkerställa den lokala rotadministratörens säkerhet genom att regelbundet spara körstatus och ladda upp bildfiler till fjärrmoni- torn. Om ett överflöd inträffar kommer den fjärrhypervisaren att ringa fjärr återställningsfunktionen på den lokala maskinen. Återställningsfunktionen tar sedan den senaste bildfilen från fjärrkontrollen och återställer sig själv. Föru- tom systemets implementeringsdetaljer, designade vi också flera testfall och testade hastigheten för varje del av systemet. Baserat på resultaten kan vi dra slutsatsen att systemet fungerar som förväntat. Contents

1 Introduction 2 1.1 Background ...... 3 1.2 Problem ...... 4 1.3 Purpose ...... 4 1.4 Goals ...... 4 1.5 Literature review ...... 5 1.6 Ethical issues, sustainability and social issues ...... 6 1.6.1 Ethical issues ...... 6 1.6.2 sustainability and social issues ...... 6 1.7 Delimitation ...... 7 1.8 Structure of the thesis ...... 7

2 Process restoration using CRIU 8 2.1 Information gathering ...... 8 2.1.1 Gathering information of process tree ...... 9 2.1.2 Collection of id information of process tree ...... 9 2.1.3 Operation on the network ...... 10 2.1.4 Information gathering of namespaces ...... 10 2.2 Backup ...... 11 2.2.1 Process tree node backup ...... 11 2.2.2 Backup steps of memory pages ...... 12 2.3 Cleanup and restore ...... 12

3 Remote Procedure Call 13 3.1 TCP/IP ...... 13 3.2 Principle and operation process of RPC ...... 14 3.3 Generate RPC frame ...... 16

v vi CONTENTS

4 Local Supervision Tree 18 4.1 Nodes and relations ...... 18 4.2 Creation of new child processes ...... 19 4.3 Watchdog timer mechanism ...... 19 4.4 Thread group and restart strategies ...... 21 4.5 Internal structure of supervisor and child ...... 22 4.6 Program execution flow ...... 22 4.7 Local supervision tree and supervision forest ...... 23

5 Remote Supervisor 25 5.1 Reasons for using RPC ...... 25 5.2 Execution procedure of Remote supervisor ...... 25 5.3 Optimization ...... 26

6 Tests and measurements 28 6.1 Test cases ...... 28 6.1.1 Testing local supervision Tree ...... 28 6.1.2 Testing remote supervisor ...... 29 6.2 Performance measurements ...... 29 6.2.1 Time required to start a local supervisor ...... 29 6.2.2 Time requisition of start a local child node ...... 31 6.2.3 Time requisition of dumping, restoring and restarting . 31 6.2.4 Analysis of time requisition of remote supervising . . 31

7 Future works 33

Bibliography 34 Notations

C/S Client/Server

CPU Central Processing Unit

CRIU Checkpoint/Restore In Userspace

FT Fault Tolerent

HTTP Hypertext Transfer Protocol

IP Internet Protocol

RPC Remote Procedure Call

SMTP Simple Mail Transfer Protocol

TCP Transmission Control Protocol

VLSI Very Large Scale Integration

VM Virtual Machine

WDT WatchDog Timer

1 Chapter 1

Introduction

In the modern era, it seems that everything is happening in the "cloud". The word "cloud" means: migrate to the cloud, run in the cloud, stored in the cloud, and access from the cloud. Simply put, the cloud is the other end of the Inter- net connection. People can access a variety of applications and services from the cloud, as well as secure storage of data. The "cloud" is so powerful for three reasons: Firstly, people do not need to maintain or manage the cloud. Secondly, the cloud can be easily expanded to infinity, so people don’t need to worry about cloud capacity. Lastly, people can access cloud-based services anytime and anywhere. With varieties of applications and services provided, the only thing people need is a device that has an Internet connection. With a cloud app, people can open a browser and log in to get started. Technically, a cloud computing system is a typical type of distributed system. The term "distributed" means that computing units can deploy in different geographic locations. Cloud computing provides IT resources, such as computing power, database storage, applications on-demand over the Internet, using a pay-per- use pricing model. The first advantage of such a system is to improve the uti- lization rate of IT resources. Moreover, for business and industrial use, compa- nies can buy the exact amount of computational power they need. Consumers can get the computing resources they need (e.g. CPU time, cloud storage, soft- ware services) in a self-service manner, anytime, anywhere, and without the need for manual interaction. [1] However, computing systems that consist of a large number of hardware and software components will eventually fail [2]. Therefore, in addition to the technical difficulties on the coordination node, the fault tolerance mechanism is also crucial for the cloud system.

2 CHAPTER 1. INTRODUCTION 3

1.1 Background

When a failure occurs inside the system, we need to use fault-tolerance tech- nologies to eliminate the impact of the fault on the system function [3]. Ac- cording to the timeliness, faults can be classified into the following three types: permanent faults, intermittent faults, and accidental faults [4]. A permanent failure is a failure that lasts forever unless repaired. For hardware, permanent failure means irreversible physical variation. For software, this type of fail- ure is an error state that cannot automatically recover. An intermittent fault is short-lived but intermittent, and they are both accidental and irregular. Occa- sional faults are transient and maybe non-repetitive [5]. Often caused by envi- ronmental changes, power supply interference, fluctuations in component per- formance, random software changes, electromagnetic interference and other factors. This type of failure can occur only once in a long time but can result in data errors or even system failures. Use of fault-tolerant methods depends on the specific situation. A fault-tolerant system automatically detects and diagnoses system faults and then adopts a strategy for controlling or handling faults. According to the failure response phase of the system, there are three types of fault-tolerant schemes: fault checking, static redundancy, and dynamic redundancy. Fault detection does not provide tolerance for faults but gives a warning when a fault occurs. Fault detection is widely used in microsystems such as micro-computers and micro- controllers, which have applied lightweight on-line detection mechanisms [6]. Strictly speaking, fault detection is not fault-tolerant. Although it detects faults, it cannot tolerate these faults and does not give fault warnings. Dynamic re- dundancy is used in error correction code memory or in systems such as ma- jority voting redundant computers with a fixed configuration (i.e., the logical connections between line devices remain the same). With the rapid development of computer hardware and networks, the system overhead of fault-tolerant computers is decreasing, and the speed of error cor- rection is gradually accelerating [7]. The fault tolerance of software methods does not have high requirements on hardware. On the contrary, the system is flexible, and resource utilization is reasonable. Artificial intelligence will be used in the detection and diagnosis of failures, and various intelligent tools of expert systems will also support fault detection and diagnosis. By these, people can use expert knowledge and use reasoning agencies to provide diag- nostic results quickly and accurately. The system’s dynamic reconstruction, fault recovery and neuron chips will be used for fault-tolerant technology and will be implemented through AI support. At the same time, the internal self- 4 CHAPTER 1. INTRODUCTION

test and self-reconfiguration of the circuit can solve the reliability problem of the circuit itself and the subsystem. There will be a fault-tolerant VLSI chip and a fault-tolerant design chip that directly support the system’s fault-tolerant design, providing system designers with fault-tolerant design components that are transparent. Research on-chip internal fault tolerance technology is now a major branch of fault tolerance research.

1.2 Problem

What kind of fault-tolerant scheme can be used in cloud computing systems, that can assure the safety of running applications and the safety of the security system itself?

1.3 Purpose

The purpose of this project is to implement a library or framework that can be used to provide a fault-tolerant mechanism for processes in cloud systems. Such a framework should be able to provide a type of process as a supervisor that can monitor and recover other processes.

1.4 Goals

The goal of this project is developing a software framework. It is divided into the following three sub-goals: Sub-goal 1: Building a local supervision system. Implementing basic functions for the supervisor mechanism: a) linking be- tween a supervisor and a worker. b) Supervisor’s ability to fault detecting and restarting of workers. Sub-goal 2: Building a remote supervision system. Since a local machine can fail by itself, the second goal is to build a remote supervisor mechanism to assure the safety of local machines. Sub-goal 3: Refine and optimization. After finish the main functions of the system, we should reach out for some optimizations to boost the performance of the system. CHAPTER 1. INTRODUCTION 5

1.5 Literature review

By doing literature study, we found that the basic techniques of fault-tolerant in cloud services are redundancy and checkpoint. The checkpoint strategy and be divided into disk-based and disk-less check- pointing. The disk-based approach has an obvious performance bottleneck because of its slower memory accessing capability, high checkpoint overhead and slower restart to achieve compared to the disk-based checkpoint. The main research problem of the disk-less checkpoint is the trade-off between the sampling rate and performance degradation. A basic strategy is to set a constant checkpoint frequency according to how much slack each task has, such that we can add a maximum number of checkpoints and assure no ad- ditional performance degradation. An example of the disk-less checkpoint is multilevel disk-less check-pointing (Hakkarinen and Chen, 2013) which can recover from N simultaneous failures by checkpoint recursion [8]. Checkpoint strategies are mainly used in tightly coupled computing applica- tions, because in such situation since the computing units are highly related, a local failure can result in global failure, which is unbearable. BlobCR (Nico- lae and Cappello, 2011) is an appropriate scheme for tightly coupled scientific applications written using Message Passing Interface to port the checkpoint images to IaaS cloud [9]. In BlobCR, all components of the process are repli- cated. In recent years, more articles about adaptive fault-tolerant techniques are com- ing out. This kind of schemes are based on combinations and variations of replication and checkpointing. The basic idea is to maintain and improve the safety of systems by adapting to environmental changes. For real-time tasks, the fault tolerance mechanism means that the program has several different instances, and each instance is executed by a different task scheduling algorithm (Malik et al., 2011). In this model, the system keeps several different real-time instances of the same function. [10]And, the final decision of execution, is made due to so-called reliability. When tasks are fin- ished on time, the reliability increases, and vice versa. If the reliability falls under a specific threshold, this VM is replaced by a new VM or recovered by a backward recovery method. Dynamic Adaptive FT Strategy (Sun et al., 2013) is for searching for a mathematical relation between failure rates and basic techniques (checkpoints and replications) [11]. 6 CHAPTER 1. INTRODUCTION

1.6 Ethical issues, sustainability and social issues

1.6.1 Ethical issues The main ethnic problem is about online privacy rights, although the legal profession has not yet formed a unified view. Some scholars believe that on- line privacy refers to the personal information of citizens on the Internet. The privacy space and the peace of network life are protected by law, and the right to illegally know, invade, spread or utilize others is prohibited. With the rapid development of science and technology, people’s daily life is increasingly de- pendent on the network. The network brings great convenience to people’s lives, while it also brings about an understanding of privacy protection. The openness, virtuality, interactivity, and anonymity of the network environment make the usual privacy protection methods incapable in the network environ- ment. There are more and more acts of disclosure and dissemination of the privacy of others through the Internet (such as “Internet mass hunting” that causes widespread concern or “serious concern” on the day to directly detect personal privacy). Reasons for these behaviours various, some for the public interest, some for the commercial value of online privacy driven by economic interests, and some for the self-satisfaction of the individual spirit, and so on. For ordinary privacy rights, there are many ways to infringe on the privacy of the Internet, including an excessive collection of personal information, il- legal access to private information on the Internet, illegal use of private data on the Internet; illegal disclosure of private data and illegal transactions. Pri- vacy has its unique features, which can be summarised as diversification of infringement forms, diversification of infringement subjects, expansion and objectification of objects, dual nature of infringement objects, the intelligence of infringement means, concealment, Serious and complicated. Users gener- ally do not care much about online privacy information, and only care about ID card numbers and bank cards. In this case, violations of online privacy are even more serious.

1.6.2 sustainability and social issues Since the beginning of humanity, the most critical work of people is to obtain the materials needed to sustain life. This state has not changed until today. This kind of life-sustaining stuff mainly comes from agriculture, so the agricultural revolution is the basis for promoting social development. The industrial rev- CHAPTER 1. INTRODUCTION 7

olution has increased the level of social productivity through the continuous inventing of tools, and promoted the emergence of a new agricultural revolu- tion, enabling people to produce more living things. However, the industrial revolution has also brought about a severe social crisis. More and more ma- terial wealth is concentrated in the hands of fewer and fewer rich people, and most people are insecure in the living environment. The growing gap between the rich and the poor further deprives the poor of their right to live in dignity. Human greed causes the rich to plunder wealth without any concern, leading to more severe crimes such as energy crises, environmental crises, and economic crises. Cloud computing can significantly improve the efficiency of resource utilisation. By the redistributing of the wealth of information, the gap in ma- terial wealth is narrowed, and it will lead to significant changes in people’s thinking and a revolution in the underlying technology. Perhaps the essence of the cloud computing revolution is to solve the various crises brought about by the industrial revolution, to rebuild a truly harmonious and peaceful society through the new industrial revolution and the agricultural revolution, and to make plants, animals, people and the environment a harmonious and peaceful life.

1.7 Delimitation

This project aims to propose a fault-tolerant software framework for cloud computing, so we do not consider hardware approaches.

1.8 Structure of the thesis

Chapter 1 of this report begins with a background and introduction to help readers understand and review the relevant areas. Then review the significance of the project, related ethical issues and related work. Chapter 2 introduces the process restoring tool CRIU, which is critical in this project, and describes how it stores the running state of a process and how to recover from it. Chapter 3 describes the principle of the remote procedure call, and the TCP/IP it depends on. TCP/IP is the transport protocol for the remote supervisor recovery pro- cess. Chapter 4 is about the structure of the local supervisor and how it works and proposes the concept of a supervised tree. Chapter 5 is about the princi- ple and operation process of the remote supervisor. Chapter 6 shows the test cases and the measurement of its functions. Chapter 7 discusses the potential improvements in the future. Chapter 2

Process restoration using CRIU

CRIU is a tool for the Linux platform to perform checkpoint/restore functions in userspace. With this tool, we can freeze the whole running application or part of it, and save the execution status of the application on the disk as a file. Then use these image files to restore the application from the frozen point of time and continue to run. With this software, we can perform real-time mi- gration, application snapshots, and remote debugging [12]. The most notable feature of CRIU is that it performs checkpoint/restore in userspace, without the need to modify the application or . The CRIU saves the program state through checkpoints. The checkpoint mainly depends on the /proc file system, because, in the Linux system, the running process information is all stored in /proc. The process dumper mainly per- forms the following tasks during the checkpoint phase.

2.1 Information gathering

CRIU can get the $pid of a process group leader using -tree option in the ter- minal [12]. By doing so, the dumping process traverses /proc/$pid/task/ to gather the essential information of the threads to be frozen. Then it scans the /proc/$pid/task/ $tid/children to recursively collect the information of children. The main steps are as follows: a) During the information collection process: - Collection of process trees - Collection of process tree id - Lock operation on the network - Collection of namespace information

8 CHAPTER 2. PROCESS RESTORATION USING CRIU 9

b) In the process of information backup - Backup for each tree node - Backup of mnt_namespace namespace - File lock information backup - According to the root node backup process tree - Back up CGroup information - Backup shared memory c) Write to the mirror factory

2.1.1 Gathering information of process tree This part is about the collection of process tree information using function col- lect_pstree. To freeze the CGroup, the freeze operation is performed on all processes un- der the PID of the CGroup. Here compel_interrupt_task only interrupts the process specified by the PID, and then compel_wait_task waits for the pro- cess to return status information. After that, collect_task collects information about all threads and child processes under the process tree and freezes them. Finally, wait for the process frozen by freeze_processes to return the , and all the process tree information is collected. a) After successfully write the status to the status files, the system interrupts the entire process tree under CGroup. b) CRIU interrupts the CGroup iteratively, gets control of the specified process from the external process and performs an interrupt operation. c) Forced to wait for the task to resume from the interrupt, until returning the signal value according to the status information returned by the child. d) Collect information about all child processes and threads under the parent process based on the information from the root node and perform an interrupt freeze operation on them. e) Collect thread information: first, collect thread information threads from directory /pid/task, then freeze the processes sequentially according to thread information.

2.1.2 Collection of id information of process tree a) Traverse each process tree node, obtain the id information of the backup task and the backup the namespace for each tree node. 10 CHAPTER 2. PROCESS RESTORATION USING CRIU

b) Backup sub-object related id information of task . c) Get the root of the red-black tree, find the appropriate insertion position of the node and generate the id. d) Back up the namespace id information, and the id information of each part gets the id. e) Generate a namespace if supported. Firstly, determine the type of names- pace if the namespace information exists. The PID of the namespace must be the same as the root node of the process tree. Otherwise, it will report an error: the nested namespace cannot be backed up.

2.1.3 Operation on the network a) First write the configuration file of the IP routing table, then switch to the specified net namespace, create a new pipe pfd and write the configuration file to pfd[1]. Then fork a new and set the signal mask. After that, the pipe output pfd[0] is redirected to the standard input, and then the iptable- restore command is executed under the child process to set the routing table according to the configuration file to achieve the purpose of the network lock. b) Switch the namespace, open the specified namespace file to get the file descriptor, and then set the specified namespace according to the descriptor. c) Restore the IP routing table and create a new pipe, write the information of configuration file conf, fork a new child process through a system command line in the specified userspace and redirect the pipe read end pfd[0] to the stan- dard input, then restore the IP routing table according to the conf configuration file.

2.1.4 Information gathering of namespaces a) Collect user namespaces for dump. b) Collect information from mnt_namespace: Through a series of calls to the information parsing function, the parsed data is obtained from /proc/mount_info and populated into the mounted information structure, and the filled structure is appended to the mount information global list mntinfo. c) Information collection for network namespaces. CHAPTER 2. PROCESS RESTORATION USING CRIU 11

Father process Child process

Switch to network namespace

pipe(fd[2]): creating a pipe

conf fd[1] fd[2]

fork()

Waitpid() waits until the child process returns Dup2 redirects fd[0] to the standard input and configures the routing table according to conf Child process returns

Analyse the return state

Figure 2.1: Lock operation on the network

2.2 Backup

2.2.1 Process tree node backup a) Infect and parasites processes to obtain a parasitic control unit: Forces in- fection of the specified PID process and causes it to start daemon mode to ac- cept messages received by the socket. b) Implementation of memory file share mapping: Create a memory file descriptor in the infected process and name it CRIUMFD, then map the memory file descriptor to the memory space of the infected process. At the same time, the memory file descriptor is mapped to the local memory, which makes the memory of the infected process mapped with the local memory too. So that the memory change of the infected process is detectable, which means successfully installing the parasitic source. c) The parasitic process starts the background service mode: First binds the sock and listens to it, then restarts the process to handle the exception of the parasitic process in the child process handler, waits for the message received from the specified socket. 12 CHAPTER 2. PROCESS RESTORATION USING CRIU

2.2.2 Backup steps of memory pages a) Preparation phase: initialization of page cache, the establishment of and creation of transmission unit structure. b) Memory Record: All virtual memory block storage corresponds to the ppb->iov pipe page buffer unit in the pipe object, and the unified management memory page in pp->iovs includes dirty page holes and regular page pages. c) Pipeline transmission: The mem- ory pages in the buffer manager pp->iovs are respectively written to the write end of each buffer ppb->p[1] pipe. d) Package all memory pages into images. e) Reset the dirty page bit, reset the dirty bit by writing 4 to the clear_refs file.

2.3 Cleanup and restore a) Cleanup: As all the items described in the last two sections have been dumped, we can use the ptrace tool to treat the dumpee by removing all parasite code and restoring the initial code. The CRIU then leaves the mission and con- tinues to run. b)Resolve shared resources. After cleaning up, CRIU analyzes the image file to obtain a set of shared relationships between processes and resources. The shared resources are then re-collected, and all other resources are inherited or otherwise acquired in the second phase. c)Fork the process tree and restore basic tasks resources. In this step, CRIU calls function fork () multiple times to recreate the process dumped in the steps from above. Then CRIU restores all resources but memory mappings exact location, timers, cre- dentials, and threads. The recovery of these resources was postponed. At this stage, the CRIU opens the related files, prepares three kinds of namespaces, maps the private memory areas and fill them with data, creates the dumped sockets, calls chdir() and chroot(), and some operations. Chapter 3

Remote Procedure Call

In this thesis work, remote procedure call is used for remotely calling the restoring procedure, so that a when a remote supervisor detects a failure of the local machine, it can recover it using RPC.

3.1 TCP/IP

The Internet Protocol contains hundreds of protocol standards, but the two most important protocols are TCP and IP, and the Internet protocol is often referred to as TCP/IP. It is a suite of protocols used on the Internet and is nowadays is used in most home and business networks [13]. When communicating, both parties must know each other’s label, just like when sending an email, one must know the other one’s email address. The unique identifier for each computer on the Internet is the IP address, like 123.123.123.123. If a computer accesses two or more networks at the same time, such as a router, it will have two or more IP addresses. Therefore, the IP address corresponds to the computer’s network interface. The IP protocol is for sending data from one computer to another via the net- work. The data is divided into small pieces and then sent out through IP pack- ets. Due to the complexity of the Internet link, there are often multiple lines between the two computers. Therefore, the router is responsible for deciding how to forward an IP packet. The characteristic of the IP packet is that it is sent by block, and multiple routes are routed, but it is not guaranteed to arrive. Nor does it guarantee that the order will arrive. The IP address is a 32-bit integer (called IPv4). However, it is often repre- sented as a string, such as 192.168.0.1. It is a digital representation of a 32-bit integer grouped by 8 bits for easy reading.

13 14 CHAPTER 3. REMOTE PROCEDURE CALL

From the composition of the IP address, we can see that the IP address space is limited, so it will finally be exhausted one day. As IP address resources be- coming increasingly tight, computer network experts have proposed the IPv6 protocol, hoping to alleviate this problem by expanding the IPv4 digits. An IPv6 address is a 128-bit integer, which is an expanded version of IPv4 cur- rently in use, represented by a string similar to 2001:0db8:85+3:0042 and 1000:8a2e:0370:7334. The TCP protocol is built on the top of the IP protocol. The TCP protocol is for establishing a reliable connection between the two computers to ensure that the data packets arrive in order. The TCP protocol establishes a connection by a set of handshakes, and then numbers each IP packet to ensure the other party receives it in order. Also, if the packet is lost, it will be automatically resent. Many commonly used higher-level protocols are based on the TCP protocol, such as the HTTP protocol for browsers, the SMTP protocol for sending mail. In addition to the data to be transmitted, a TCP packet contains the source IP address and the destination IP address, the source port and the destination port.

3.2 Principle and operation process of RPC

RPC is the abbreviation of Remote Procedure Call. Birrell and Nelson’s pa- per published in ‘ACM Transactions on Computer Systems’ in 1984 gave a classic interpretation of RPC. RPC refers to the procedure on caller computer, calling another procedure on the callee computer. Then the calling procedure on the caller is suspended, and the called procedure on callee computer starts executing. When the value is returned to the caller, the callee procedure re- sumes [14]. The caller can pass the information to the callee through passing parameters, and then the information can be obtained from the returned result. What’s more, this process is transparent to developers. The remote procedure call frame uses client/server (C/S) mode. The C/S mode is known as the client and server architecture. The client-server model is de- signed to facilitate the sharing of information between the two ends of the communication. It allows a large number of users to simultaneously access in- formation from the database [15]. It is a software system architecture, through which it can take full advantage of the hardware environment at both ends, and distribute tasks to the client and server to achieve the communication overhead. In this case, the requester is a client, while the service provider acts as a server. When the request is received by the server, the operating system on the server passes it to a so-called server stub. The server stub corresponds to a client stub CHAPTER 3. REMOTE PROCEDURE CALL 15

client process server process

Client function Server function

(1) (10) (6) (5)

Client stub Server stub

(2) (9) (7) (4)

(8)

system call system call

(3) network local core communication remote core

Figure 3.1: remote procedure call process on the server-side and is a block of code that translates the requests entered lo- cal procedure calls. Typically, a server stub calls function ’receive’ first, then block itself and wait for a message to enter. After receiving the message, The server parses the parameters from the received message and invokes the corre- sponding procedure on the server in a normal way. From a server perspective, the procedure seems to be called directly by the client: all the function call in- formation is within the stack. The server executed the required functions and then returned the results to the client in a conventional manner. In the end, the server stub passes the control to the client to issue the call, package the result (buffer) into a message, and then call the function ’send’ to return the result to the client. After the event, the server stub calls the function ’receive’ again to 16 CHAPTER 3. REMOTE PROCEDURE CALL

get ready for the next input request. After the client machine has received the message, the guest operating system is informed that the message belongs to a client procedure (the process is ac- tually a client stub. However, an operating system is not able to distinguish between the two). The operating system copies the message into the appro- priate cache and then unblocks the client process. The client stub checks the message, extracts the return value and copies it to the caller, and then returns it in the usual way. When the caller regains control after the read call has fin- ished, the only thing that is transparent to it is that it has the required data. Whether the operation is done on the local operating system or remotely is invisible to it. Throughout the method, the client can ignore content that is not of interest. The operation involved by the client is simply performing a regular (local) procedure call to access the remote service. It does not need to call function ’send’ or function receive directly. Details of the message are encapsulated in the library process of both sides, just as the traditional library hides the details of executing the actual system call. The advantage of RPC is that the interaction mode is simple, and it’s easy to use because service is provided as an interface. The interaction protocol be- tween client and server is easy to unify. Generally, mature companies maintain their own RPC framework, such as Baidu’s sofa-pbRPC, google’s gRPC.Most of the companies can use the RPC framework to generate all interface packet and unpacking code, users only need to adjust the function. Using a RPC framework is simple, only need a proto file that can describe the protocol in- teraction on both sides. Because the description file (proto file) is enough for both sides to keep consistent. RPC is also very convenient to test. Most RPC frameworks are cross-language, so we can write test programs in a more con- venient scripting language (such as python) to simulate interaction with C/C++ programs.

3.3 Generate RPC frame

Rpcgen is a compiler that allows people to easily write RPC programs and au- tomatically generate interface code for network connections, eliminating the hassle of handwriting these codes. So it can be regarded as an automatic code generation tool for RPC. The basic steps of generating RPC framework using Rpcgen are as follows: a) Run the following command in Terminal: » mkdir rpcroutine » cd rpcroutine CHAPTER 3. REMOTE PROCEDURE CALL 17

» vi rpc.x b) Type the following code in rpc.x: program RPCFRAME { version VERSION { string RPCTEST(string) = 1; } = 1; } = 12345678; c) Generate code using rpcgen: » rpcgen rpc.x and get the following files: rpc_clnt.c rpc.h rpc_svc.c d) Generate rpc_clnt_func.c » rpcgen -Sc -o rpc_clnt_func.c rpc.x e) Generate rpc_srv_func.c » rpcgen -Ss -o rpc_srv_func.c rpc.x f) Compile the client code: » gcc -Wall -o rpc_server rpc_srv_func.c rpc_svc.c g) Compile the server code: » gcc -Wall -o rpc_client rpc_clnt_func.c rpc_clnt.c h) Start the server » ./rpc_server i) Start the client » ./rpc_client 127.0.0.1 Chapter 4

Local Supervision Tree

In this supervision system, we have two types of nodes: supervisor and worker. The supervisor is a node responsible for starting, cancelling, and monitoring the running state of its child. The worker is a type of node that runs a process that does the actual work. The supervisor is designed to deal with failures of workers, but a supervisor itself can fail too. So, we propose a so-called local supervision tree system to ensure each supervisor is supervised.

4.1 Nodes and relations

S

W W S

W S

Figure 4.1: An example of supervision tree

Fig 4.1 shows a typical supervision tree. In this figure, S is the abbrevia- tion of Supervisor, and W is for workers. In a supervision tree, only supervi- sors can be father nodes, and the number of children it can bear is unlimited.

18 CHAPTER 4. LOCAL SUPERVISION TREE 19

Leaf nodes can either be supervisors or workers (although a leaf supervisor is meaningless).

4.2 Creation of new child processes

Through the system call ’fork’, we can create a new process with the same run- ning stack as the current process. The child process inherits the entire address space of the parent process. We usually refer to the new process as a child pro- cess and the current process as the parent process. Including process context, stack address, memory information process control block (PCB), etc. Accord- ing to Linux C process creating mechanism, the father node firstly copies itself by fork function, and then use execl function to and cover this replication and execute the aim script (see Figure 4.2).

Father process

Replication by fork()

Father process Father process

Cover by execl()

Child process

Figure 4.2: Child creating process.

4.3 Watchdog timer mechanism

The watchdog is a counter that can be reset within a certain period of time. When the watchdog starts, the counter starts counting automatically. After a certain period of time, if the counting is not reset, the counter sends a reset signal when the counter reaches the specified value. Such mechanism can 20 CHAPTER 4. LOCAL SUPERVISION TREE

detect if the process is exited abnormally or stuck in any infinite loops. Many devices, including the CPU, receive this signal and reset and restart. In order to ensure that the watchdog does not produce reset signals, the watchdog counter needs to be cleared within the time interval allowed by the watchdog, and the counter is recounted. If the system works normally and is guaranteed to "feed the dog" on time, then it will be fine. Once the program fails, there is no "dog feeding", and the system is "bitten" to reset.

Father process Child process

sem_create

sem_wait sem_post sleep()

sleep()

Successed or fail

WDT overflow

Figure 4.3: Procedure of running watchdog timer

In this project, the heartbeat process works as a watchdog resetter, which periodically sends reset signals to its supervisor (if have). The inter-process communication method used here is semaphore, which is created after finish- ing setting up the child node. Firstly, the father node starts a watchdog process with the process id of the child. Then it creates a semaphore, using child pro- cess id as the id of semaphore. This method of id setting makes it easier for the father and child to communicate. Figure 4.3 shows the mechanism of the watchdog timer. The procedure of watchdog reset/overflow is shown in Figure 4.4: Father creates a semaphore and periodically tries if can lock it. Mean- while, the child periodically posts the semaphore. Once the child is dead or gets stuck in a loop, it won’t be able to send this reset signal. Thus the father fails when trying to wait for it and an overflow occurs. CHAPTER 4. LOCAL SUPERVISION TREE 21

4.4 Thread group and restart strategies

The previous section describes a primary coarse-grained watchdog monitor- ing method. In practice, a process usually consists of a number of threads. However, the importance of these threads is often different. When some of the non-critical threads fail or get stuck into an infinite loop, we want to ignore these errors and let the process continue to run. This requires setting watch- dogs for different threads in the same process to detect the exact running state of processes. The implementation goes like this: when a new worker process is created, no longer create a watchdog for it, but create a monitor process to manage the watchdog. The monitor thread communicates with the child pro- cess through a message queue. Once the new process is created in the child process, the child process sends information that is needed to run a watchdog timer. A single message includes the group number (importance) of the thread, and the thread number (specified by the user) and this information are used to specify the id of the watchdog of different threads.

Supervisor

Monitor Child worker Group CRITICAL Message queue Working thread WDT_Thread 1

WDT_Thread 2 Reset WDT_Thread 1 Reset WDT_Thread 2

Group NON-CRITICAL Reset WDT_Thread 3 WDT_Thread 3 Reset WDT_Thread 4

WDT_Thread 4

Figure 4.4: Multi-watchdog mechanism 22 CHAPTER 4. LOCAL SUPERVISION TREE

4.5 Internal structure of supervisor and child

Except for heartbeat part, a supervisor consists of data, functions, threads and interfaces to a child. Figure 4.5 shows the structure of the supervisor and worker.

Child Heartbeat Father (Supervisor) (Supervisor/Worker) FUNCTIONS Heartbeat Working process Delete child New child Working process DATA THREADS INTERFACES Child1 info Child1 WDT Start/Restart

Child2 info Child2 WDT Cancel

Child3 info Child3 WDT Reset WDT

Figure 4.5: Internal structure of supervisor and worker

The core part of a supervisor is the information of child processes. With the need to dynamically deleting/creating new information, we implement the data part as a unidirectional linked list. Each node of the list stores information of a child: process type (supervisor or child), script to be executed, execution state, the process id of watch timer and maximum restart times. The structure of a child is simple: a heartbeat process and a working process that does the actual work. The function part is for manipulating the child list. To begin with, the supervisor adds a new child node to the child list, allocate memory and initialize it. Secondly, a new child process is created by forking, and it starts to send heartbeats. Then, the supervisor starts to accept the heartbeat and monitor if the child process is running correctly. If not, for child supervisors, use CRIU to restore it; for child workers, restart it.

4.6 Program execution flow

Fig 4.6 shows a sequence diagram describing the procedure from start to child restoration. Each vertical line represents an independent process, and different branches within the lane represent different processes. Fork() means creating a new process. To begin with, the main script starts as a panel process, then forks a root supervisor. The panel process communi-

CHAPTER 4. LOCAL SUPERVISION TREE 23

main Root supervisor Child process 1 Child process 2

fork Panel pthread_create supervisor

receiver

new child fork pthread_create child 01: pthread_create WDT_01 supervisor

receiver Reset_WDT & store imgs new child fork pthread_create child_02: WDT_02 worker Reset_WDT & store imgs Reset_WDT & store imgs

no reset

restore Reset_WDT & store imgs

Figure 4.6: Running procedure from start to child restoration cates with this root supervisor and any other supervisors via message queues. As the root supervisor starts, it creates a receiver thread to accept instructions from any panel that is connected to it by its process id. Each time the supervi- sor finishes to start a new child process, it begins watchdog timer to monitor it. A child process can be another supervisor or worker. The child process periodically sends WDT to reset signals to proof its liveness. If the heartbeat is interrupted, the father automatically recognises it as a failure and terminate it, and then, if the child is a worker, start a new instance of it or if the child is a supervisor, restore it.

4.7 Local supervision tree and supervision forest

Since supervisors could fail, a supervisor needs another supervisor to monitor it. Thus, we can set up supervision trees to realise it. 24 CHAPTER 4. LOCAL SUPERVISION TREE

S S : Supervisor

W : Worker

W W A B : A supervises B

S S S : Top-level Supervisors Local Machine W W S S S

W S W W

Figure 4.7: An example of supervision forest

However, such supervision tree cannot bear the failure of its root. So, we recognise a set of supervision trees and let their roots be the supervisor of each other. Fig 4.7 represents a supervision forest that root supervisors are organised as a loop. In this loop, only when all of the supervisors have failed, this local machine becomes unable to restore itself. Such a mechanism ensures that each of the nodes has its own supervisor. Chapter 5

Remote Supervisor

In the local supervision tree section, we explored the security of the local pro- cess and the security of the supervisor process, but in actual operation, the local machine may also fail, so in this chapter, we propose a remote monitor- ing process to meet the above requirements.

5.1 Reasons for using RPC

When using a remote procedure call, the information format is transparent. In a native application, to call an object, we need to pass parameters and return a call result. The caller does not need to care about how the parameters are used inside the called object and how the results are returned. So for remote calls, these parameters are passed to another computer on the network in some form of information. And the caller does not need to care this information format is structured. So there should be cross-language capabilities: because the caller does not actually know which language the remote server application is running on. So for the caller, this call should succeed regardless of the language used by the server, and return values should also be described in a form understandable by the calling program language.

5.2 Execution procedure of Remote supervi- sor

This mechanism is mainly for root supervisors so the supervisor should be started using ’-r’ option. When the root supervisor is successfully initialized, it starts a dumper to periodically checkpoint the root supervisor process and

25 26 CHAPTER 5. REMOTE SUPERVISOR

generate an image file which is needed by the restoration. In CRIU, a dump operation cannot be done if the dumping process itself is within the dumping tree. So to start a dumper, we first fork a new child process, and then fork a grand-child process and let the main process of the son process exit. By doing this, the dumper process has no direct kinship with its grand-parent process, and it can dump its grand-parent.

Local Machine Server

root supervisor server

dumper rpc_server client rpc_client

dump beat img files wdt_reset

dump beat img files wdt_reset

dump beat img files wdt_reset

wdt_overflow restore

root supervisor

dumper rpc_server client

dump beat img files wdt_reset

dump beat img files wdt_reset

dump beat img files wdt_reset

Figure 5.1: Remote supervisor mechanism

The different thing in the root supervisor(compared to child supervisors) is that it possesses a file client thread. Through this client, the root supervisor can connect itself to the file server via TCP/IP. The file client thread periodically sends a copy of the compressed image file of the root supervisor. Each time the server receives an image file, it resets the watchdog timer for the client. Once the watchdog timer overflows, the RPC client that works with the file server starts a request to call the pre-defined function in RPC server: download image files from the file server, unzip it and use it to restore the supervisor.

5.3 Optimization

The design details described above shows the final design of the system. In this section, we will describe the original design and compare it to the current design. CHAPTER 5. REMOTE SUPERVISOR 27

Rpc client Rpc server Rpc client Rpc server

call put img WDT reset WDT reset get

send img

call put img WDT reset WDT reset get

send img

no call no call

WDT overflow WDT overflow get put img send img restore restore call put img WDT reset WDT reset get

send img

Figure 5.2: Comparison between before and after optimization

The original design is more intuitive but also more complex. Fig 6.1 shows a simplified version of the execution sequence before and after optimization(In order to simplify the representation and highlight the changes before and after, the fact that the file transfer is performed by another file server and the client is omitted.): Before optimization: a) The local supervisor initiates an RPC, targeting the service function in the RPC server. b) In the first call of the service function, the RPC server starts a watchdog thread to monitor the RPC client. Also, each time the service function is called, the function resets the watchdog timer. c) RPC server starts a request to get the image file of the supervisor that is to be monitored. d) RPC client sends the image file. e) If the service function in the RPC server is not called for a preset time in- terval, a watchdog overflow occurs. f) The RPC server puts the most recent image file back to the RPC client. g) The RPC client restores the supervisor.

After optimization: Instead of requesting files after resetting the watchdog, the watchdog timer resetting is done directly when an image file is sent to the RPC server. And the RPC server calls the restore function when the watchdog timer overflows. Chapter 6

Tests and measurements

6.1 Test cases

6.1.1 Testing local supervision Tree

sup -r

sup -c worker worker

worker worker

Figure 6.1: Test case 1 - local supervision tree

The setup steps of the test case are as follows: a) start a root supervisor using the command ’./sup -r’(option ’-r’ denotes root supervisor). b) Start three child nodes: one child supervisor A and two child workers.c) Start two grandchildren nodes under A. The structure of the testing case is shown in fig 7.1. In this test, termination of any of the child nodes got successful

28 CHAPTER 6. TESTS AND MEASUREMENTS 29

restart/restorations.

6.1.2 Testing remote supervisor It takes several steps to establish a remote supervising relation: a) Start on the local machine. The script starts a dumper for the main process, an RPC server to receive restoring commands, and a file client to periodically send image files to the file server. b) Start the server, including a file server to receive and store the image files sent from the node being supervised and an RPC client to call the restore function on the local machine.

Local machine Server

local sup dumper file_server

rpc_server file_client rpc_client

Figure 6.2: Test case 2 - remote supervisor

In this test, when all set-up is done, the server begins to receive image files and reset the watchdog timer periodically. If we intentionally terminate the local supervisor, the remote supervisor successfully called the restoration function on the local machine, and the local supervisor continued to run.

6.2 Performance measurements

6.2.1 Time required to start a local supervisor In this section, we represent the measurement results of the execution time needed to start a supervisor. In the measurement, we used time.h from the C standard library, which was used to obtain time and date and time in this measurement. Since the scheduling of the operating system changes over time, the run- ning time required for the same program to run at different times is different. Therefore, in this measurement, we measured the run time of each component ten times at different times and averaged them. We can tell from Table 7.1 that the item which takes the most time is super- visor initialization, and then forking control panels and initializing variables. The reason is that in function supervisor_init, we start a receiver thread for 30 CHAPTER 6. TESTS AND MEASUREMENTS

supervisor init variables para check fork panel sup init 1 0.129 0.028 0.001 0.036 0.075 2 0.122 0.022 0.001 0.031 0.073 3 0.137 0.019 0.001 0.033 0.067 4 0.109 0.025 0.002 0.029 0.065 5 0.118 0.022 0.001 0.028 0.069 6 0.119 0.023 0.001 0.042 0.077 7 0.147 0.021 0.002 0.038 0.074 8 0.154 0.019 0.001 0.031 0.068 9 0.121 0.025 0.001 0.032 0.071 10 0.135 0.029 0.001 0.032 0.072 Average 0.1291 0.0233 0.0012 0.0332 0.0711

Table 6.1: Running time of starting a new supervisor and its components(ms) the supervisor so that it can get the commands like starting/terminating child nodes, and initialized a linked list and used function malloc() to allocate mem- ory for the list nodes. These operations which require system calls are more time-consuming. Forking a control panel is time-consuming as expected while initializing variables is not because of the declaration of large buffers and get- ting the process id of the process itself.

supervisor worker 1 0.375 0.332 2 0.357 0.352 3 0.346 0.364 4 0.348 0.329 5 0.362 0.381 6 0.335 0.366 7 0.434 0.392 8 0.367 0.321 9 0.356 0.362 10 0.329 0.379 Average 0.3609 0.3578

Table 6.2: Time required to start nodes(ms) CHAPTER 6. TESTS AND MEASUREMENTS 31

6.2.2 Time requisition of start a local child node Table 6.2 shows that, although a supervisor node is much more complicated than a worker node, the start time needed is approximately the same. So the time consumption is basically type-irrelevant. Using system call to allocate a new recording node and insert it into a linked list, forking a new process.

6.2.3 Time requisition of dumping, restoring and restart- ing In this section, we represent the result of the execution time of dumping, restor- ing and restarting. Since the CRIU commands cannot be used if the CRIU process itself is within the dumping tree [12], we take the following steps to create a new process: firstly use function fork() to create a son process. Then let the son process fork a grandson process to run CRIU command. The last step is to terminate the son process. By doing so, the CRIU process is no longer within the process tree because it is adopted by the process with an id of 0.

operation dumping restoring restarting 1 0.962 2 0.361 2 1.336 3 0.332 3 1.211 2 0.434 4 0.899 2 0.367 5 1.102 2 0.348 6 1.111 3 0.353 7 1.034 3 0.335 8 0.849 2 0.362 9 1.298 2 0.356 10 0.992 2 0.329 Average 1.0794 2.3 0.3577

Table 6.3: Time required to dump, restore and restart(ms)

6.2.4 Analysis of time requisition of remote supervis- ing The time needed by restoring a remotely supervised node consists of to parts: RTT (round trip time) of RPC and restarting time of the target supervisor. 32 CHAPTER 6. TESTS AND MEASUREMENTS

The RTT is determined by three parts: the propagation time of the link, the processing time of the terminal system, the queuing and processing time in the router cache [16]. The restarting time is the same as starting a local supervisor. So: PRC restoring time requisition tr= RTT/2 + time of restoring local supervisor. Here we measure the RTT time by using the Timestamp option of TCP. The TCP timestamp option can be used to measure RTT accurately . RTT = current time - the echo time of the Timestamp option in the packet. This echo time is the time the packet was sent. By measuring the reception time (current time) and transmission time (echo time) of the data packet, we can get a measurement of the RTT. Results are shown in Table 6.4.

Items RTT(ms) restoring(ms) RPC restore(ms) 0 20 2 12 1 25 3 15.5 2 23 2 13.5 3 15 2 9.5 4 24 2 14 5 22 3 14 6 19 3 12.5 7 30 2 17 8 24 2 14 9 24 2 14 Average 22.6 2.3 13.6

Table 6.4: RPC restore time(ms) Chapter 7

Future works

This project successfully built a basic supervision system both locally and re- motely. Moreover, through several tests, we can see that the main functions in the system work normally under the designed test cases, there are still some potential defects. One main problem is that the system does not support con- current access: if we deploy the system in a case of frequent occurrences of errors, this monitoring system will collapse itself due to inconsistent or con- fusing data caused by concurrent access. A possible solution is to use a finer-grained set of mutex locks to protect the data. For example, when doing an inserting operation, we only lock the head node, and when the delete operation is being performed, the node to be deleted and its previous node are locked. The main data structure used is a linked list. In the supervisor process, this can be a performance bottleneck. For a larger scale of problems, using a hash table can be a good option. Another improvement that can be made is that in this system, the overhead of checkpoints is still considerable. And checkpoints do not fully guarantee the security of the system. For example, if the program has a problem (most likely) when the last checkpoint is collected, then the roll- back of the running state does not solve the problem. So we need a more robust rollback mechanism. For example, if the program still fails after a rollback, continue to try earlier checkpoints (but this will increase the overhead of the system).

33 Bibliography

[1] Tharam S. Dillon, Chen Wu, and Elizabeth Chang. “Cloud Computing: Issues and Challenges”. In: 2010 24th IEEE International Conference on Advanced Information Networking and Applications (2010), pp. 27– 33. [2] Flaviu Cristian. “Understanding Fault-Tolerant Distributed Systems”. In: Commun. ACM 34 (1991), pp. 56–78. [3] X. Xiao-dong. “Research on Multi-thread Parallel Computing Fault- Tolerant Technology”. In: 2018 IEEE 4th Information Technology and Mechatronics Engineering Conference (ITOEC). 2018, pp. 1384–1387. [4] Roozbeh Bakhshi, Surya Tej Kunche, and Michael G. Pecht. “Intermit- tent Failures in Hardware and Software”. In: 2014. [5] Aakriti Gupta and Shreta Sharma. “Software Maintenance: Challenges and Issues”. In: Issues 1.1 (2015), pp. 23–25. [6] E. Dwiggins David. “Fault tolerant microcontroller for the configurable Fault Tolerant Processor”. In: (2008). [7] Trio Adiono, Syifaul Fuada, and Rosmianto Aji Saputro. “Rapid Devel- opment of System-on-Chip (SoC) for Network-Enabled Visible Light Communications”. In: International Journal of Recent Contributions from Engineering, Science, and IT (iJES) 6 (Feb. 2018). [8] Douglas Hakkarinen and Zizhong Chen. “Multilevel Diskless Check- pointing”. In: IEEE Transactions on Computers 62 (2013), pp. 772– 783. [9] Bogdan Nicolae and Franck Cappello. “BlobCR: Efficient Checkpoint- Restart for HPC Applications on IaaS Clouds using Virtual Disk Image Snapshots”. In: Nov. 2011, p. 34.

34 BIBLIOGRAPHY 35

[10] Sheheryar Malik and Fabrice Huet. “Adaptive Fault Tolerance in Real Time Cloud Computing”. In: 2011 IEEE World Congress on Services (2011), pp. 280–287. [11] Dawei Sun et al. “Analyzing, modeling and evaluating dynamic adap- tive fault tolerance strategies in cloud computing environments”. In: The Journal of Supercomputing 66 (2013), pp. 193–228. [12] CRIU:About. https://criu.org/CRIU:About/. Accessed July 4, 2019. [13] Quentin Docter and Jon Buhagiar. “Introduction to TCP/IP”. In: (Apr. 2019), pp. 363–402. [14] Andrew Birrell and Bruce Jay Nelson. “Implementing Remote Proce- dure Calls”. In: ACM Trans. Comput. Syst. 2 (1984), pp. 39–59. [15] Shakirat Sulyman. “Client-Server Model”. In: IOSR Journal of Com- puter Engineering 16 (Jan. 2014), pp. 57–71. [16] Jing Wu et al. “A New Sustainable Interchain Design on Transport Layer for Blockchain”. In: Smart Blockchain. Ed. by Meikang Qiu. Cham: Springer International Publishing, 2018, pp. 12–21. isbn: 978-3-030- 05764-0. TRITA TRITA-EECS-EX-2019:654

www.kth.se