Principles of Software

Martin Monperrus University of Lille & Inria, France [email protected] January 27, 2017

Abstract reliability. The notion of “” comes from the There are many software concepts book by simply entitled and techniques related to software errors. But “Antifragile” [14]. Antifragility is a property of is this enough? Have we already completely ex- systems, whether natural or artificial: a system plored the software engineering noosphere with is antifragile if it thrives and improves when fac- respect to errors and reliability? In this paper, ing errors. Taleb has a broad definition of “error”: I discuss an novel concept, called “software an- it can be volatility (e.g. for financial systems), tifragility”, that is unconventional and has the attacks and shocks (e.g. for immune systems), capacity to improve the way we engineer errors death (e.g. for human systems), etc. Yet, Taleb’s and dependability in a disruptive manner. This essay is not at all about engineering, and it re- paper first discusses the foundations of software mains to translate the power and breadth of his antifragilty, from classical fault tolerance to the vision into a set of sound engineering principles. most recent advances on automatic software re- This paper provides a first step in this direction pair and fault injection in production. This pa- and discusses the relations between traditional per then explores the relation between the an- software engineering concepts and antifragility. tifragility of the development process and the First, I relate software antifragility to classi- antifragility of the resulting software product. cal fault tolerance. Second, I show the link be- tween antifragility and the most recent advances on automatic software repair and fault injection. 1 Introduction Third, I explore the relation between the an- tifragility of the development process and the The software engineering body of knowledge on antifragility of the resulting software product. software errors and reliability is not short of con- This paper is a revised version of an Arxiv cepts, starting from the classical definitions of paper [9]. faults, errors and failures [1], continuing with the techniques for fault-freeness proofs, fault removal and fault tolerance, etc. But is this enough? 2 Software Fragility Have we already completely explored the space of software engineering concepts related to er- There are many pieces of evidence of software rors? In this paper, I discuss a novel concept, fragility, sometimes referred to as “software brit- that I call “software antifragility”, which as the tleness”, [13]. For instance, the inaugural flight capacity to radically change the way we reason of Ariane 5 ended up with the total destruction about software errors and the way we engineer of the rocket, because of an overflow in a sub-

1 component of the system. At a totally different plexity also naturally suffer from errors, as com- scale, in the Eclipse development environment, a plex biological and ecological systems do. Once single external plugin of a low-level library pro- one acknowledges the necessary existence of soft- viding optional features crashes the whole sys- ware errors in large and interconnected software tem and makes it unusable (this is a recent exam- systems[13, 10], it changes the game. ple of fragility from December 20131). Software fragility seems independent of scale, domain and 3.1 Fault-tolerance and An- implementation technology. There are means to combat fragility: fault pre- tifragility vention, fault tolerance, fault removal, and fault Instead of aiming at error-free software, there are forecasting [1]. Software engineers strive for de- software engineering techniques to constantly de- pendability, they do their best to prevent, de- tect errors in production (aka self-checking soft- tect and repair errors. They prevent bugs by ware [18]) and to tolerate them as well (aka fault following best practices, They detect bugs by ex- tolerance [11]). Self-checking and self-testing and tensively testing them and comparing the imple- fault-tolerance is not loving errors literally, but mentation against the specification, They repair it is an interesting first step. bugs reported by testers or users and ship the In Taleb’s view, a key point of antifragility is fixes in the next release. However, despite those that an antifragile system becomes better and efforts, most software remains fragile. There are stronger under continuous attacks and errors. pragmatic explanations to this fragility: lack of The immune system, for instance, has this prop- education, technical debs in legacy systems, or erty: it requires constant pressure from microbes the economic pressure for writing cheap code. to stay reactive. Self-detection of bugs is not an- However, I think that the reason is more fun- tifragile, software may detect a lot of erroneous damental: we do not take the right perspective states, but it would not make it detect more. on errors. For fault tolerance, the frontier blurs. If the fault tolerance mechanism is static there is no advantage from having more faults. If the fault 3 Software Antifragility tolerance mechanism is adaptive [6] and if some- thing is learned when an error happens, the sys- As Taleb puts it, an antifragile system “loves er- tem always improves. We hit here a first charac- rors”. Software engineers do not. First, errors teristic of software antifragility. A software sys- cost money: it is time-consuming to find and to tem with dynamic, adaptive fault tolerance capa- repair bugs. Second, they are unpredictable: one bilities is antifragile: exposed to faults, it contin- can hardly forecast when and where they will oc- uously improves. cur, one can not precisely estimate the difficulty of repairing them. Software errors are tradition- ally considered as a plague to be eradicated and 3.2 Automatic Runtime Bug Re- this is the problem. pair Possibly, instead of damning errors, one can Fault removal, i.e. bug repair, is one means to see them as an intrinsic characteristic of the sys- attain reliability [1]. Let us now consider soft- tems we build. Complex systems have errors: in ware that repairs its own bugs at runtime and biological systems, errors constantly occur: DNA call the corresponding body of techniques “au- pairs are not properly copied, cells mutate, etc. tomatic runtime repair” (also called “automatic Software systems of reasonable size and com- recovery” and also “self-healing” [7]). 1https://bugs.eclipse.org/bugs/show_bug.cgi? There are two kinds of automatic software re- id=334466 pair: state repair and behavioral repair [8]. State

2 repair consists in modifying a program’s state system’s error recovery capabilities; if the sys- during its execution (the registers, the heap, the tem can handle those injected faults, it is likely stack, etc.). Demsky and Rinard’s paper on data to handle real-world natural faults of the same structure repair [4] is an example of such state re- . Third, monitoring the impact of each in- pair. Behavioral repair consists in modifying the jection gives the opportunity to learn something program behavior, with runtime patches. The on the system itself and the real environmental patch, whether binary or source, is synthesized conditions. and applied at runtime, with no human in the Because of these three effects, injecting faults loop. For instance, the application communities in production makes the system better. This of Locasto and colleagues [7] share behavioral corresponds to the main characteristic of an- patches for repairing faults in C code. tifragility: “the antifragile loves error”. It is not As said previously, a software system can be purely the injected faults that improve the sys- considered as antifragile as long as it learns some- tem, it is the impact of injected faults on the thing from bugs that occur. Automatic runtime engineering ecosystem (the design principles, the bug repair at the behavioral level corresponds mindset of engineers, etc). I will come back on to antifragility, since each fixed bug results in the profound relation between product and pro- a change in the code, in a better system. This cess in Section 4. A software system using fault means “loving errors”: a software system with self-injection in production is antifragile, it de- runtime bug repair capabilities loves errors be- creases the of missing, or incorrect or rotting cause those errors continuously trigger improve- of error-handling code by continuously exercising ments of the system itself. it. Injecting faults in production must come with 3.3 Failure Injection in Production a careful analysis of the the dependability losses. There must be a balance between the depend- If you really “love errors”, you always want more ability losses (due to injected system failures) of them. In software, one can create artificial and the dependability gains (due to software im- errors using techniques called fault and failure provements) that result from using fault injec- injection. So, literally, software that “loves er- tion in production. Measuring this tradeoff is rors” would continuously self-injects faults and the key point of antifragile software engineering. perturbations. Would it make sense? The idea of fault injection in production is un- By self-injecting failures, a software system conventional but not new. In 1975, Yau and Che- constantly exercises its error-recovery capabili- ung [18] proposed inserting fake “ghost planes” ties. If the system resists those injected failures, in an air traffic control system. If all the ghost it will likely resist similar real-world failures. For planes land safely while interacting with the sys- instance, in a distributed system, servers may tem and human operators, one can really trust crash or be disconnected from the rest of the net- the system. Recently, a company named Netflix work. In a distributed system with fault injec- released a “simian army” [5, 2], whose different tion, a fault injector may randomly crash some kinds of monkeys inject faults in their services servers (an example of such an injector is the and datacenters. For instance, the “Chaos Mon- Chaos Monkey [2]). key” randomly crashes some production servers, Ensuring the occurrence of faults has three and the “Latency Monkey” arbitrarily increases positive effects on the system. First, it forces en- and decreases the latency in the server network. gineers to think of error-recovery as a first-class They call this practice “chaos engineering”. engineering element: the system must at least From 1975 to today, the idea of fault injection be able to resist the injected faults. Second, it in production has remained almost invisible. Au- gives engineers and users confidence about the tomated fault injection in production has rather

3 been overlooked so far ( This concept is not men- ous deployment, errors have smaller impacts. No tioned in the cornerstone paper by Avizienis, La- massive groups of interacting features and fixes prie and Randell. [1].). However, the nascent arrive in production at the same time. When an chaos engineering community may signal a real error is found in production, the new version can shift. be released very quickly before an catastrophic propagation. Also, when an error is found in production, 4 Software Development it applies to a version that is close to the Process Antifragility most recent version of the software product (the “HEAD” version). Fixing an error in HEAD is On the one hand, there is the software, the prod- usually much easier than fixing an error in a past uct, and on the other hand there is the process version, because the patch can seamlessly be ap- that builds the product. In Taleb’s view, an- plied to all close versions, and because the de- tifragility is a concept that also applies to pro- velopers usually have the latest version in mind. cesses. For instance, he says that the Silicon Both properties (ease of deployment, ease of fix- Valley innovation process is quite antifragile, be- ing) contribute to minimize the effects of errors. cause it deeply admits errors, and both inventors We recognize here a property of antifragility as and investors both know that many startups will Taleb puts it: If you want to become antifragile, eventually fail. I now discuss the antifragility as- put yourself in the situation “loves errors” [...] pect of the software development process. by making these numerous and small in harm. (Taleb [14]). 4.1 Test-driven Development 4.2 Bus Factor In test-driven development, developers write au- tomated tests for each feature they write. When In software development, the “bus factor” mea- a bug is found, a test that reproduces the bug sures to what extent people are essential to a is first written; then the bug is fixed. The re- project. If a key developer is hit by a bus (or any- sulting strength of the test suite gives develop- thing similar in effect), could it bring the whole ers much confidence in the ability of their code project down? In dependability terms, such a to resist changes. Concretely, this confidence en- consequence means that there is a failure propa- ables them to put “refactoring” as a key phase gation from a minor issue to a catastrophic effect. of development. Since developers have an aid There are management practices to cope with (the test suite) to assess the correctness of their this critical risk. For instance, one technique is to software, they can continuously refine the design regularly move people from projects to project, or the implementation. They refactor fearlessly, so that nobody concentrates essential knowledge. having little doubts that they can break anything At one extreme is “If a programmer is indispens- that will go unnoticed. Furthermore, test-driven able, get rid of him as quickly as possible” [17]. development allows continuous deployment, as In the short-term, moving people is sub-optimal. opposed to long release cycles. Continuous de- From a people perspective, they temporarily lose ployment means that features and bug fixes are some productivity when they join a new project, released in production in a daily manner (and in order to learn a new set of techniques, con- sometimes several times a day). It is the trust ventions, and communication patterns. They given by automated tests that allows continuous will often feel frustrated and unhappy because deployment. of this. From a project perspective, when a de- What is interesting with test-driven develop- veloper leaves, the project experiences a small ment is the second order effect. With continu- slow-down. The slow-down lasts until the rest of

4 the team grasps the knowledge and know-how of brittle programming languages and execution en- the developer who has just left. However, from a vironments. That would be a 21th century echo long-term perspective, it decreases the bus fac- to Van Neuman’s dream of building reliable sys- tor. In other terms, moving people transforms tems from unreliable components [16]. rare a,d irreversible large errors (project failure) into lots of small errors (productivity loss, slow down). This is again antifragile. References [1] A. Avizienis, J.-C. Laprie, B. Randell, 4.3 Conway’s Law et al. Fundamental concepts of dependabil- In programming, Conway’s law states that the ity. Technical report, University of Newcas- “organizations which design systems [...] are tle upon Tyne, 2001. constrained to produce designs which are copies [2] A. Basiri, N. Behnam, R. de Rooij, of the communication structures of these orga- L. Hochstein, L. Kosewski, J. Reynolds, and nizations” [3]. Raymond famously put this as C. Rosenthal. Chaos engineering. IEEE “If you have four groups working on a compiler, Computer, 33(3):35 – 41, 2016. you’ll get a 4-pass compiler” [12] More generally, the engineering process has an [3] M. E. Conway. How do committees invent? impact on the product architecture and proper- Datamation, 14(4):28–31, 1968. ties. In other terms, some properties of a sys- [4] B. Demsky and M. Rinard. Automatic de- tem emerge from the process employed to build tection and repair of errors in data struc- it. Since antifragility is a property, there may tures. ACM SIGPLAN Notices, 38(11):78– be software development processes that hinder 95, 2003. antifragility in the resulting software and others that foster it. The latter would be “antifragile [5] Y. Izrailevsky and A. Tseitlin. The Net- software engineering”. flix simian army. http://techblog. I tend to think that the engineers that set up netflix.com/2011/07/netflix-simian- antifragile processes better know the nature of army.html, 2011. errors than others. I believe that developers en- [6] Z. T. Kalbarczyk, R. K. Iyer, S. Bagchi, rolled in an antifragile process become imbued and K. Whisnant. Chameleon: A soft- of some values of antifragility. Tseitlin’s concept ware infrastructure for adaptive fault tol- of “antifragile organizations” is along the same erance. IEEE Transactions on Parallel and line[15]. Because of this, I hypothesize that an- Distributed Systems, 10(6):560–579, 1999. tifragile software development processes are bet- ter at producing antifragile software systems. [7] M. E. Locasto, S. Sidiroglou, and A. D. Keromytis. Software self-healing using col- laborative application communities. In Pro- 5 Conclusion ceedings of the Symposium on Network and Distributed Systems Security, 2006. This is only the beginning of antifragile soft- ware engineering. Beyond the vision presented [8] M. Monperrus. A critical review of here, research now has to devise sound engi- "automatic patch generation learned from neering principles and techniques regarding self- human-written patches": Essay on the checking, self-repair and fault injection in pro- problem statement and the evaluation of duction. Because of the amount of legacy soft- automatic software repair. In Proceedings ware, a major research avenue is to invent ways of the International Conference on Software to develop antifragile software on top of existing Engineering, 2014.

5 [9] M. Monperrus. Principles of antifragile soft- [14] N. N. Taled. Antifragile. Random House, ware. Technical Report 1404.3056, Arxiv, 2012. 2014. [15] A. Tseitlin. The antifragile organization. [10] H. Petroski. To Engineer is Human: The Commun. ACM, 56(8):40–44, Aug. 2013. Role of Failure in Successful Design. Vin- tage Books, 1992. [16] J. von Neumann. Probabilistic logics and the synthesis of reliable organisms from un- [11] B. Randell. System structure for software reliable components. Automata Studies, fault tolerance. IEEE Transactions on Soft- 1956. ware Engineering, SE-1(2):220 –232, june 1975. [17] G. M. Weinberg. The psychology of com- puter programming. Van Nostrand Reinhold [12] E. S. Raymond et al. The jargon file. New York, 1971. http://catb.org/jargon/, last accessed Jan. 2014, -. [18] S. Yau and R. Cheung. Design of self- checking software. In ACM SIGPLAN No- [13] M. Shaw. Self-healing: softening precision tices, volume 10, pages 450–455. ACM, to avoid brittleness. In Proceedings of the 1975. first workshop on self-healing systems, 2002.

6