Principles of Antifragile Software
Total Page:16
File Type:pdf, Size:1020Kb
Principles of Antifragile Software Martin Monperrus University of Lille & Inria, France [email protected] January 27, 2017 Abstract reliability. The notion of “antifragility” comes from the There are many software engineering concepts book by Nassim Nicholas Taleb simply entitled and techniques related to software errors. But “Antifragile” [14]. Antifragility is a property of is this enough? Have we already completely ex- systems, whether natural or artificial: a system plored the software engineering noosphere with is antifragile if it thrives and improves when fac- respect to errors and reliability? In this paper, ing errors. Taleb has a broad definition of “error”: I discuss an novel concept, called “software an- it can be volatility (e.g. for financial systems), tifragility”, that is unconventional and has the attacks and shocks (e.g. for immune systems), capacity to improve the way we engineer errors death (e.g. for human systems), etc. Yet, Taleb’s and dependability in a disruptive manner. This essay is not at all about engineering, and it re- paper first discusses the foundations of software mains to translate the power and breadth of his antifragilty, from classical fault tolerance to the vision into a set of sound engineering principles. most recent advances on automatic software re- This paper provides a first step in this direction pair and fault injection in production. This pa- and discusses the relations between traditional per then explores the relation between the an- software engineering concepts and antifragility. tifragility of the development process and the First, I relate software antifragility to classi- antifragility of the resulting software product. cal fault tolerance. Second, I show the link be- tween antifragility and the most recent advances on automatic software repair and fault injection. 1 Introduction Third, I explore the relation between the an- tifragility of the development process and the The software engineering body of knowledge on antifragility of the resulting software product. software errors and reliability is not short of con- This paper is a revised version of an Arxiv cepts, starting from the classical definitions of paper [9]. faults, errors and failures [1], continuing with the techniques for fault-freeness proofs, fault removal and fault tolerance, etc. But is this enough? 2 Software Fragility Have we already completely explored the space of software engineering concepts related to er- There are many pieces of evidence of software rors? In this paper, I discuss a novel concept, fragility, sometimes referred to as “software brit- that I call “software antifragility”, which as the tleness”, [13]. For instance, the inaugural flight capacity to radically change the way we reason of Ariane 5 ended up with the total destruction about software errors and the way we engineer of the rocket, because of an overflow in a sub- 1 component of the system. At a totally different plexity also naturally suffer from errors, as com- scale, in the Eclipse development environment, a plex biological and ecological systems do. Once single external plugin of a low-level library pro- one acknowledges the necessary existence of soft- viding optional features crashes the whole sys- ware errors in large and interconnected software tem and makes it unusable (this is a recent exam- systems[13, 10], it changes the game. ple of fragility from December 20131). Software fragility seems independent of scale, domain and 3.1 Fault-tolerance and An- implementation technology. There are means to combat fragility: fault pre- tifragility vention, fault tolerance, fault removal, and fault Instead of aiming at error-free software, there are forecasting [1]. Software engineers strive for de- software engineering techniques to constantly de- pendability, they do their best to prevent, de- tect errors in production (aka self-checking soft- tect and repair errors. They prevent bugs by ware [18]) and to tolerate them as well (aka fault following best practices, They detect bugs by ex- tolerance [11]). Self-checking and self-testing and tensively testing them and comparing the imple- fault-tolerance is not loving errors literally, but mentation against the specification, They repair it is an interesting first step. bugs reported by testers or users and ship the In Taleb’s view, a key point of antifragility is fixes in the next release. However, despite those that an antifragile system becomes better and efforts, most software remains fragile. There are stronger under continuous attacks and errors. pragmatic explanations to this fragility: lack of The immune system, for instance, has this prop- education, technical debs in legacy systems, or erty: it requires constant pressure from microbes the economic pressure for writing cheap code. to stay reactive. Self-detection of bugs is not an- However, I think that the reason is more fun- tifragile, software may detect a lot of erroneous damental: we do not take the right perspective states, but it would not make it detect more. on errors. For fault tolerance, the frontier blurs. If the fault tolerance mechanism is static there is no advantage from having more faults. If the fault 3 Software Antifragility tolerance mechanism is adaptive [6] and if some- thing is learned when an error happens, the sys- As Taleb puts it, an antifragile system “loves er- tem always improves. We hit here a first charac- rors”. Software engineers do not. First, errors teristic of software antifragility. A software sys- cost money: it is time-consuming to find and to tem with dynamic, adaptive fault tolerance capa- repair bugs. Second, they are unpredictable: one bilities is antifragile: exposed to faults, it contin- can hardly forecast when and where they will oc- uously improves. cur, one can not precisely estimate the difficulty of repairing them. Software errors are tradition- ally considered as a plague to be eradicated and 3.2 Automatic Runtime Bug Re- this is the problem. pair Possibly, instead of damning errors, one can Fault removal, i.e. bug repair, is one means to see them as an intrinsic characteristic of the sys- attain reliability [1]. Let us now consider soft- tems we build. Complex systems have errors: in ware that repairs its own bugs at runtime and biological systems, errors constantly occur: DNA call the corresponding body of techniques “au- pairs are not properly copied, cells mutate, etc. tomatic runtime repair” (also called “automatic Software systems of reasonable size and com- recovery” and also “self-healing” [7]). 1https://bugs.eclipse.org/bugs/show_bug.cgi? There are two kinds of automatic software re- id=334466 pair: state repair and behavioral repair [8]. State 2 repair consists in modifying a program’s state system’s error recovery capabilities; if the sys- during its execution (the registers, the heap, the tem can handle those injected faults, it is likely stack, etc.). Demsky and Rinard’s paper on data to handle real-world natural faults of the same structure repair [4] is an example of such state re- nature. Third, monitoring the impact of each in- pair. Behavioral repair consists in modifying the jection gives the opportunity to learn something program behavior, with runtime patches. The on the system itself and the real environmental patch, whether binary or source, is synthesized conditions. and applied at runtime, with no human in the Because of these three effects, injecting faults loop. For instance, the application communities in production makes the system better. This of Locasto and colleagues [7] share behavioral corresponds to the main characteristic of an- patches for repairing faults in C code. tifragility: “the antifragile loves error”. It is not As said previously, a software system can be purely the injected faults that improve the sys- considered as antifragile as long as it learns some- tem, it is the impact of injected faults on the thing from bugs that occur. Automatic runtime engineering ecosystem (the design principles, the bug repair at the behavioral level corresponds mindset of engineers, etc). I will come back on to antifragility, since each fixed bug results in the profound relation between product and pro- a change in the code, in a better system. This cess in Section 4. A software system using fault means “loving errors”: a software system with self-injection in production is antifragile, it de- runtime bug repair capabilities loves errors be- creases the risk of missing, or incorrect or rotting cause those errors continuously trigger improve- of error-handling code by continuously exercising ments of the system itself. it. Injecting faults in production must come with 3.3 Failure Injection in Production a careful analysis of the the dependability losses. There must be a balance between the depend- If you really “love errors”, you always want more ability losses (due to injected system failures) of them. In software, one can create artificial and the dependability gains (due to software im- errors using techniques called fault and failure provements) that result from using fault injec- injection. So, literally, software that “loves er- tion in production. Measuring this tradeoff is rors” would continuously self-injects faults and the key point of antifragile software engineering. perturbations. Would it make sense? The idea of fault injection in production is un- By self-injecting failures, a software system conventional but not new. In 1975, Yau and Che- constantly exercises its error-recovery capabili- ung [18] proposed inserting fake “ghost planes” ties. If the system resists those injected failures, in an air traffic control system. If all the ghost it will likely resist similar real-world failures. For planes land safely while interacting with the sys- instance, in a distributed system, servers may tem and human operators, one can really trust crash or be disconnected from the rest of the net- the system. Recently, a company named Netflix work. In a distributed system with fault injec- released a “simian army” [5, 2], whose different tion, a fault injector may randomly crash some kinds of monkeys inject faults in their services servers (an example of such an injector is the and datacenters.