Maximizing Error Injection Realism for Chaos Engineering with System Calls
Total Page:16
File Type:pdf, Size:1020Kb
Maximizing Error Injection Realism for Chaos Engineering with System Calls Long Zhang1, Brice Morin2, Benoit Baudry1, and Martin Monperrus1 1KTH Royal Institute of Technology, Sweden 2Tellu, Norway Abstract—In this paper, we present a novel fault injection number of invocations to system calls that naturally fail in framework for system call invocation errors, called PHOEBE. production. We propose a novel fault injection framework PHOEBE is unique as follows; First, PHOEBE enables developers called PHOEBE, for doing realistic error injection at the system to have full observability of system call invocations. Second, HOEBE PHOEBE generates error models that are realistic in the sense call level. To define realistic errors, P first observes that they mimic errors that naturally happen in production. the natural system call invocation errors which happen in a Third, PHOEBE is able to automatically conduct experiments to production system. Then it analyzes those previously observed systematically assess the reliability of applications with respect errors to synthesize a series of realistic error injection models to system call invocation errors in production. We evaluate the that systematically amplify natural errors. The injection of effectiveness and runtime overhead of PHOEBE on two real-world applications in a production environment for a single software such realistic errors brings valuable insights into the error stack: Java. The results show that PHOEBE successfully generates handling capabilities of an application with respect to realistic realistic error models and is able to detect important reliability system call invocation errors. To our knowledge, the synthesis weaknesses with respect to system call invocation errors. To our of those realistic error models is the key novelty of PHOEBE. knowledge, this novel concept of “realistic error injection”, which We evaluate PHOEBE with two real-world applications: consists of grounding fault injection on production errors, has never been studied before. Hedwig, an email server that uses the SMTP and IMAP protocols, and TTorrent, a file downloading client based on the Index Terms—fault injection, system call, chaos engineering BitTorrent protocol. During the experiments in a production environment, PHOEBE observes that 84 million invocations to I. INTRODUCTION 23 unique system calls naturally fail. Based on these errors, In cloud-based software systems, one cannot fully control PHOEBE synthesizes 32 and 33 realistic error injection models the production execution environment and many unexpected for HedWig and TTorrent respectively. The generated error events keep happening: hardware issues, network fluctuations, models are then used to perform a series of chaos engineering and unanticipated user behavior [36], [61]. In order to assess experiments, which reveal important reliability shortcomings and improve the reliability of software systems in such a in both applications. The results of these experiments demon- changing and imperfect environment, different kinds of tech- strate the feasibility, applicability and added value of PHOEBE niques are being researched, in particular fault injection [45], for analyzing reliability against system call invocation errors. [58]. Fault injection evaluates software reliability by actively To sum up, our contributions are the following. injecting errors into the software system under study [34], • The concept of synthesizing realistic error injection mod- arXiv:2006.04444v7 [cs.SE] 2 Apr 2021 [64]. A recent trend in fault injection consists of injecting els for system calls, based on the amplification of errors faults in production directly [3], [8], [63], this is known in the that are naturally observed in production. industry as “chaos engineering”. In this paper, we use “chaos • PHOEBE, a fault injection framework that implements engineering” for referring to fault injection in production. this concept. PHOEBE monitors a production system, It is known that the space of all possible error injection is generates realistic error models, conducts chaos engineer- large [4]. In other words, it is potentially intractable to explore ing fault injection experiments, and outputs a reliability all possible error scenarios. When doing fault injection in assessment with respect to system call invocation errors. production, one does not want to impact users with unrealistic PHOEBE is publicly available for future research in this errors. In this paper, we address the problem of defining a area1. tractable error injection space, by exclusively focusing on • An empirical evaluation of PHOEBE with two real-world realistic errors. Our novel definition of realistic errors is that applications, an email server and a file downloading the injected errors resemble the ones that naturally happen in client, under a production workload. The results show that production. By injecting realistic errors, we identify reliability PHOEBE can inject realistic errors at runtime, in produc- issues that are more relevant for developers. tion, with low overhead, in order to detect error handling In this paper, we realize this idea in the realm of system call weaknesses with respect to system call invocation errors. errors. This focus is motivated by the essential role of system calls to analyze software behavior [21], and by the significant 1https://github.com/KTH/royal-chaos/tree/master/phoebe 1 The rest of the paper is organized as follows: Section II resources such as hardware devices and process scheduling are introduces the background. Section III and Section IV present usually managed by the kernel. When a user application needs the design and evaluation of PHOEBE. Section V discusses the to interact with a given critical resource, the corresponding runtime overhead of PHOEBE and the threats to the validity of system call is invoked. For example, in Linux, the open this research work. Section VI presents the related work, and system call is invoked when an application needs to open Section VII summarizes the paper. or create a file. Linux defines and implements more than 300 unique system calls, as well as over 100 error codes to II. BACKGROUND precisely report on errors upon invocation of those system calls A. Observability in Software Systems [55]. A software system is said to be observable if it is possible As discussed in Section II-B, an application may be per- to analyze the system’s internal state on the basis of its turbed by both hardware errors and software errors. Sometimes external behavior [40]. For example, by observing an HTTP such errors can not be handled by the operating system on its 500 response code instead of 200, developers are able to know own. Thus it propagates the error to the application, by failing that there are some errors in the system when handling an a system call invocation with an error code. HTTP request. In the context of errors, observability relates to the ability of detecting when an error naturally occurs in a III. DESIGN &IMPLEMENTATION OF PHOEBE software system. In this case, observability helps developers to Now we describe PHOEBE, a system for defining and in- evaluate the system’s error detection and handling capabilities. jecting realistic system call perturbations in software systems. There are mainly three categories of observability tech- niques [53]: 1) logging the system’s internal state. For ex- A. Objective of PHOEBE ample, using Exception.printStackTrace() method in Java to log stack information when an exception occurs. Applications contain a large amount of code for handling 2) monitoring metrics exposed by a system. For example, system call invocation errors [49]. Some system call invocation monitoring the memory usage of an application in order to de- errors happening in production do not have an visible impact tect memory leak issues. and 3) tracing externally observable on the application’s behavior, because the resilience mecha- events. For example, tracing an HTTP request that propagates nisms embedded in the application handle them properly [41]. through micro-services for timeout-related bug analysis. However, it is not possible for developers to know whether these fault-tolerance mechanisms behave properly if these errors were to occur more frequently. B. Chaos Engineering If a given type of system call invocation error rarely Chaos engineering is a recent fault injection technique, happens, it may take a long time to observe the corresponding which consists of injecting faults in production in order to abnormal behavior. Similarly, even if a type of system call verify actual reliability [8]. Different from traditional fault error happens very frequently, it is not clear whether there injection techniques, chaos engineering focuses on evaluating is a threshold on the error rate below which the application systems while they are deployed into production. This is nec- can handle the errors and beyond which it starts to behave essary for large-scale software systems because they are practi- abnormally. The key problem addressed by PHOEBE is to cally impossible to set up realistically in a testing environment. evaluate whether system call errors can have a severe impact Chaos engineering usually builds the fault injection model on the application’s behavior if the error rate is higher than based on real events (such as server crashes). Since faults are usual. injected in production, special techniques are employed so as In order to bring insights about how an application behaves to keep side effects acceptable.