Fault-Tolerant Cloud Services Supervision System That Ensures the Safety of Running Processes

DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019 Fault-Tolerant Cloud Services Supervision system that ensures the safety of running processes KEHAN MU KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Fault-Tolerant Cloud Services KEHAN MU Master in Embedded Software Date: September 18, 2019 Supervisor: Vinay Yadhav Examiner: Elena Dubrova School of Electrical Engineering and Computer Science Host company: Ericsson Swedish title: Feltoleranta molntjänster iii Abstract Nowadays, due to the convenience of deployment, ease to scale up and cost savings, the application of cloud computing systems has spread throughout the factory, commercial and individual users. However, fault tolerance in cloud computing systems has always been an important topic due to the high failure rate caused by the sheer size of cloud computing systems. This thesis presents an implementation of a fault-tolerant system called "supervision system" as a fault-tolerant mechanism for cloud computing systems. We first proposed a supervisor-worker relation: a supervisor node is responsible for monitoring its child (worker or another supervisor), and the worker node which does the actual work periodically reset a timer in its supervisor. If the corresponding timer overflows, the supervisor marks it as a failure, and try to restore or restart a new instance of it. The system also supports a multi-watchdog mode, which uses more fine-grained watchdogs that group the threads in the worker and applies different strategies to the groups. Besides the local system, we also implemented a remote supervision system to ensure the safety of local root supervisors, by periodically saving its running state and uploading the image files to its remote supervisor. If an overflow occurs, the remote supervisor re- motely calls the restore function on the local machine. Then the restore function gets the most recent image files from the remote supervisor and restores itself. In addition to the implementation details of the system, we designed several test cases and tested the speed of each system part. According to the results, we can conclude that the system works as we expected. iv Sammanfattning Idag används molnberäkningssystem över hela fabriken, kommersiella och en- skilda användare på grund av enkel installation, enkel expansion och kostnads- besparingar. Feltolerans i molnberäkningssystem har dock alltid varit ett vik- tigt ämne, eftersom den stora storleken på molnberäkningssystem har lett till höga felfrekvenser. Detta dokument introducerar implementeringen av ett feltolerant system som kallas ett "övervakat systemsom en feltolerant mekanism för molnberäkningssystem. Vi föreslår först ett förhållande mellan arbetsleda- re och arbetare: en handledare nod ansvarar för att övervaka sina barn (per- sonal eller annan handledare), och arbetsnoden som utför det verkliga arbetet återställer periodvis timern i sin handledare. Om motsvarande timer går över, markerar handledaren den som misslyckad och försöker återuppta eller starta om sin nya instans. Systemet stöder också ett flermonitorläge som använder finare skärmar som grupperar trådar i arbetaren och tillämpar olika policy- er för gruppen. Förutom det lokala systemet har vi implementerat ett fjärrö- vervakningssystem för att säkerställa den lokala rotadministratörens säkerhet genom att regelbundet spara körstatus och ladda upp bildfiler till fjärrmoni- torn. Om ett överflöd inträffar kommer den fjärrhypervisaren att ringa fjärr återställningsfunktionen på den lokala maskinen. Återställningsfunktionen tar sedan den senaste bildfilen från fjärrkontrollen och återställer sig själv. Föru- tom systemets implementeringsdetaljer, designade vi också flera testfall och testade hastigheten för varje del av systemet. Baserat på resultaten kan vi dra slutsatsen att systemet fungerar som förväntat. Contents 1 Introduction 2 1.1 Background . .3 1.2 Problem . .4 1.3 Purpose . .4 1.4 Goals . .4 1.5 Literature review . .5 1.6 Ethical issues, sustainability and social issues . .6 1.6.1 Ethical issues . .6 1.6.2 sustainability and social issues . .6 1.7 Delimitation . .7 1.8 Structure of the thesis . .7 2 Process restoration using CRIU 8 2.1 Information gathering . .8 2.1.1 Gathering information of process tree . .9 2.1.2 Collection of id information of process tree . .9 2.1.3 Operation on the network . 10 2.1.4 Information gathering of namespaces . 10 2.2 Backup . 11 2.2.1 Process tree node backup . 11 2.2.2 Backup steps of memory pages . 12 2.3 Cleanup and restore . 12 3 Remote Procedure Call 13 3.1 TCP/IP . 13 3.2 Principle and operation process of RPC . 14 3.3 Generate RPC frame . 16 v vi CONTENTS 4 Local Supervision Tree 18 4.1 Nodes and relations . 18 4.2 Creation of new child processes . 19 4.3 Watchdog timer mechanism . 19 4.4 Thread group and restart strategies . 21 4.5 Internal structure of supervisor and child . 22 4.6 Program execution flow . 22 4.7 Local supervision tree and supervision forest . 23 5 Remote Supervisor 25 5.1 Reasons for using RPC . 25 5.2 Execution procedure of Remote supervisor . 25 5.3 Optimization . 26 6 Tests and measurements 28 6.1 Test cases . 28 6.1.1 Testing local supervision Tree . 28 6.1.2 Testing remote supervisor . 29 6.2 Performance measurements . 29 6.2.1 Time required to start a local supervisor . 29 6.2.2 Time requisition of start a local child node . 31 6.2.3 Time requisition of dumping, restoring and restarting . 31 6.2.4 Analysis of time requisition of remote supervising . 31 7 Future works 33 Bibliography 34 Notations C/S Client/Server CPU Central Processing Unit CRIU Checkpoint/Restore In Userspace FT Fault Tolerent HTTP Hypertext Transfer Protocol IP Internet Protocol RPC Remote Procedure Call SMTP Simple Mail Transfer Protocol TCP Transmission Control Protocol VLSI Very Large Scale Integration VM Virtual Machine WDT WatchDog Timer 1 Chapter 1 Introduction In the modern era, it seems that everything is happening in the "cloud". The word "cloud" means: migrate to the cloud, run in the cloud, stored in the cloud, and access from the cloud. Simply put, the cloud is the other end of the Inter- net connection. People can access a variety of applications and services from the cloud, as well as secure storage of data. The "cloud" is so powerful for three reasons: Firstly, people do not need to maintain or manage the cloud. Secondly, the cloud can be easily expanded to infinity, so people don’t need to worry about cloud capacity. Lastly, people can access cloud-based services anytime and anywhere. With varieties of applications and services provided, the only thing people need is a device that has an Internet connection. With a cloud app, people can open a browser and log in to get started. Technically, a cloud computing system is a typical type of distributed system. The term "distributed" means that computing units can deploy in different geographic locations. Cloud computing provides IT resources, such as computing power, database storage, applications on-demand over the Internet, using a pay-per- use pricing model. The first advantage of such a system is to improve the uti- lization rate of IT resources. Moreover, for business and industrial use, compa- nies can buy the exact amount of computational power they need. Consumers can get the computing resources they need (e.g. CPU time, cloud storage, software services) in a self-service manner, anytime, anywhere, and without the need for manual interaction. [1] However, computing systems that consist of a large number of hardware and software components will eventually fail [2]. Therefore, in addition to the technical difficulties on the coordination node, the fault tolerance mechanism is also crucial for the cloud system. 2 CHAPTER 1. INTRODUCTION 3 1.1 Background When a failure occurs inside the system, we need to use fault-tolerance tech- nologies to eliminate the impact of the fault on the system function [3]. Ac- cording to the timeliness, faults can be classified into the following three types: permanent faults, intermittent faults, and accidental faults [4]. A permanent failure is a failure that lasts forever unless repaired. For hardware, permanent failure means irreversible physical variation. For software, this type of failure is an error state that cannot automatically recover. An intermittent fault is short-lived but intermittent, and they are both accidental and irregular. Occa- sional faults are transient and maybe non-repetitive [5]. Often caused by envi- ronmental changes, power supply interference, fluctuations in component performance, random software changes, electromagnetic interference and other factors. This type of failure can occur only once in a long time but can result in data errors or even system failures. Use of fault-tolerant methods depends on the specific situation. A fault-tolerant system automatically detects and diagnoses system faults and then adopts a strategy for controlling or handling faults. According to the failure response phase of the system, there are three types of fault-tolerant schemes: fault checking, static redundancy, and dynamic redundancy. Fault detection does not provide tolerance for faults but gives a warning when a fault occurs. Fault detection is widely used in microsystems such as micro-computers and micro- controllers, which have applied lightweight on-line detection mechanisms [6]. Strictly speaking, fault detection is not fault-tolerant. Although it detects faults, it cannot tolerate these faults and does not give fault warnings. Dynamic redundancy is used in error correction code memory or in systems such as ma- jority voting redundant computers with a fixed configuration (i.e., the logical connections between line devices remain the same). With the rapid development of computer hardware and networks, the system overhead of fault-tolerant computers is decreasing, and the speed of error correction is gradually accelerating [7]. The fault tolerance of software methods does not have high requirements on hardware.

Load more