Making Reliable Distributed Systems in the Presence of Sodware Errors Final Version (With Corrections) — Last Update 20 November 2003
Total Page:16
File Type:pdf, Size:1020Kb
Making reliable distributed systems in the presence of sodware errors Final version (with corrections) — last update 20 November 2003 Joe Armstrong A Dissertation submitted to the Royal Institute of Technology in partial fulfilment of the requirements for the degree of Doctor of Technology The Royal Institute of Technology Stockholm, Sweden December 2003 Department of Microelectronics and Information Technology ii TRITA–IMIT–LECS AVH 03:09 ISSN 1651–4076 ISRN KTH/IMIT/LECS/AVH-03/09–SE and SICS Dissertation Series 34 ISSN 1101–1335 ISRN SICS–D–34–SE c Joe Armstrong, 2003 Printed by Universitetsservice US-AB 2003 iii To Helen, Thomas and Claire iv Abstract he work described in this thesis is the result of a research program T started in 1981 to find better ways of programming Telecom applica- tions. These applications are large programs which despite careful testing will probably contain many errors when the program is put into service. We assume that such programs do contain errors, and investigate methods for building reliable systems despite such errors. The research has resulted in the development of a new programming language (called Erlang), together with a design methodology, and set of libraries for building robust systems (called OTP). At the time of writing the technology described here is used in a number of major Ericsson, and Nortel products. A number of small companies have also been formed which exploit the technology. The central problem addressed by this thesis is the problem of con- structing reliable systems from programs which may themselves contain errors. Constructing such systems imposes a number of requirements on any programming language that is to be used for the construction. I discuss these language requirements, and show how they are satisfied by Erlang. Problems can be solved in a programming language, or in the stan- dard libraries which accompany the language. I argue how certain of the requirements necessary to build a fault-tolerant system are solved in the language, and others are solved in the standard libraries. Together these form a basis for building fault-tolerant sodware systems. No theory is complete without proof that the ideas work in practice. To demonstrate that these ideas work in practice I present a number of case studies of large commercially successful products which use this technol- ogy. At the time of writing the largest of these projects is a major Ericsson v vi ABSTRACT product, having over a million lines of Erlang code. This product (the AXD301) is thought to be one of the most reliable products ever made by Ericsson. Finally, I ask if the goal of finding better ways to program Telecom applications was fulfilled—I also point to areas where I think the system could be improved. Contents Abstract v 1 Introduction 1 1.1 Background . 2 Ericsson background ..................... 2 Chronology .......................... 2 1.2 Thesis outline . 7 Chapter by chapter summary ................ 7 2 The Architectural Model 11 2.1 Definition of an architecture . 12 2.2 Problem domain . 13 2.3 Philosophy . 16 2.4 Concurrency oriented programming . 19 2.4.1 Programming by observing the real world . 21 2.4.2 Characteristics of a COPL . 22 2.4.3 Process isolation . 22 2.4.4 Names of processes . 24 2.4.5 Message passing . 25 2.4.6 Protocols . 26 2.4.7 COP and programmer teams . 26 2.5 System requirements . 27 2.6 Language requirements . 28 2.7 Library requirements . 29 2.8 Application libraries . 30 2.9 Construction guidelines . 31 2.10 Related work . 32 vii viii ABSTRACT 3 Erlang 39 3.1 Overview . 39 3.2 Example . 41 3.3 Sequential Erlang . 44 3.3.1 Data structures . 44 3.3.2 Variables . 46 3.3.3 Terms and patterns . 47 3.3.4 Guards . 48 3.3.5 Extended pattern matching . 49 3.3.6 Functions . 50 3.3.7 Function bodies . 52 3.3.8 Tail recursion . 52 3.3.9 Special forms . 54 3.3.10 case . 54 3.3.11 if . 55 3.3.12 Higher order functions . 55 3.3.13 List comprehensions . 57 3.3.14 Binaries . 58 3.3.15 The bit syntax . 60 3.3.16 Records . 63 3.3.17 epp . 64 3.3.18 Macros . 64 3.3.19 Include files . 66 3.4 Concurrent programming . 66 3.4.1 register . 67 3.5 Error handling . 68 3.5.1 Exceptions . 69 3.5.2 catch . 70 3.5.3 exit . 71 3.5.4 throw . 72 3.5.5 Corrected and uncorrected errors . 72 3.5.6 Process links and monitors . 73 3.6 Distributed programming . 76 3.7 Ports . 77 3.8 Dynamic code change . 78 ix 3.9 A type notation . 80 3.10 Discussion . 82 4 Programming Techniques 85 4.1 Abstracting out concurrency . 86 4.1.1 A fault-tolerant client-server . 92 4.2 Maintaining the Erlang view of the world . 101 4.3 Error handling philosophy . 104 4.3.1 Let some other process fix the error . 104 4.3.2 Workers and supervisors . 106 4.4 Let it crash . 107 4.5 Intentional programming . 109 4.6 Discussion . 111 5 Programming Fault-tolerant Systems 115 5.1 Programming fault-tolerance . 116 5.2 Supervision hierarchies . 118 5.2.1 Diagrammatic representation . 120 5.2.2 Linear supervision . 121 5.2.3 And/or supervision hierarchies . 122 5.3 What is an error? . 123 5.3.1 Well-behaved functions . 126 6 Building an Application 129 6.1 Behaviours . 129 6.1.1 How behaviours are written . 131 6.2 Generic server principles . 132 6.2.1 The generic server API . 132 6.2.2 Generic server example . 135 6.3 Event manager principles . 137 6.3.1 The event manager API . 139 6.3.2 Event manager example . 141 6.4 Finite state machine principles . 141 6.4.1 Finite state machine API . 143 6.4.2 Finite state machine example . 144 x ABSTRACT 6.5 Supervisor principles . 146 6.5.1 Supervisor API . 146 6.5.2 Supervisor example . 147 6.6 Application principles . 153 6.6.1 Applications API . 153 6.6.2 Application example . 154 6.7 Systems and releases . 156 6.8 Discussion . 157 7 OTP 161 7.1 Libraries . 163 8 Case Studies 167 8.1 Methodology . 168 8.2 AXD301 . 170 8.3 Quantitative properties of the sodware . 171 8.3.1 System Structure . 174 8.3.2 Evidence for fault recovery . 177 8.3.3 Trouble report HD90439 . 177 8.3.4 Trouble report HD29758 . 180 8.3.5 Deficiencies in OTP structure . 181 8.4 Smaller products . 185 8.4.1 Bluetail Mail Robustifier . 185 8.4.2 Alteon SSL accelerator . 188 8.4.3 Quantitative properties of the code . 189 8.5 Discussion . 190 9 APIs and Protocols 193 9.1 Protocols . 195 9.2 APIs or protocols? . 197 9.3 Communicating components . 198 9.4 Discussion . 199 10 Conclusions 201 10.1 What has been achieved so far? . 201 xi 10.2 Ideas for future work . 202 10.2.1 Conceptual integrity . 202 10.2.2 Files and bang bang . 203 10.2.3 Distribution and bang bang . 204 10.2.4 Spawning and bang bang . 205 10.2.5 Naming of processes . 205 10.2.6 Programming with bang bang . 206 10.3 Exposing the interface - discussion . 207 10.4 Programming communicating components . 208 A Acknowledgments 211 B Programming Rules and Conventions 215 C UBF 247 D Colophon 275 References 277 xii ABSTRACT 1 Introduction ow can we program systems which behave in a reasonable manner in the presence of sodware errors? This is the central question H that I hope to answer in this thesis. Large systems will probably always be delivered containing a number of errors in the sodware, nevertheless such systems are expected to behave in a reasonable manner. To make a reliable system from faulty components places certain re- quirements on the system. The requirements can be satisfied, either in the programming language which is used to solve the problem, or in the standard libraries which are called by the application programs to solve the problem. In this thesis I identify the essential characteristics which I believe are necessary to build fault-tolerant sodware systems. I also show how these characteristics are satisfied in our system. Some of the essential characteristics are satisfied in our programming language (Erlang), others are satisfied in library modules written in Erlang. Together the language and libraries form a basis for building reliable sod- ware systems which function in an adequate manner even in the presence of programming errors. Having said what my thesis is about, I should also say what it is not about. The thesis does not cover in detail many of the algorithms used as building blocks for construction fault-tolerant systems—it is not the al- gorithms themselves which are the concern of this thesis, but rather the programming language in which such algorithms are expressed. I am also 1 2 CHAPTER 1. INTRODUCTION not concerned with hardware aspects of building fault-tolerant systems, nor with the sodware engineering aspects of fault-tolerance. The concern is with the language, libraries and operating system re- quirements for sodware fault-tolerance. Erlang belongs to the family of pure message passing.