Performance Studies of Fault Tolerant Middleware

3HUIRUPDQFH6WXGLHVRI )DXOW7ROHUDQW0LGGOHZDUH 'LDQD6]HQWLYiQ\L Abstract Today’s software engineering and application development trend is to take advan- tage of reusable software. Much effort is directed towards easing the task of devel- oping complex, distributed, network based applications with reusable components. To ease the task of the distributed systems’ developers, one can use middleware, i.e. a software layer between the operating system and the application, which han- dles distribution transparently. A crucial feature of distributed server applications is high availability. This implies that they must be able to continue activity even in presence of crashes. Embedding fault tolerance mechanisms in the middleware on top of which the application is running, offers the potential to reduce application code size thereby reducing developer effort. Also, outage times due to server crashes can be reduced, as failover is taken care of automatically by middleware. However, a trade-off is involved: during periods with no failures, as information has to be collected for the automatic failover, client requests are serviced with higher latency. To characterize the real benefits of middleware, this trade-off needs to be studied. Unfortunately, to this date, few trade-off studies involving middleware that supports fault tolerance with application to realistic cases have been conducted. The contributions of the thesis are twofold: (1) insights based on empirical studies and (2) a theoretical analysis of components in a middleware equipped with fault tolerance mechanisms. In connection with part (1) the thesis describes detailed implementation of two platforms based on CORBA (Common Object Request Broker Architecture) with fault tolerance capabilities: one built by following the FT-CORBA standard, where only application failures are taken care of, and a second obtained by implementing an algorithm that ensures uniform treatment of infrastructure and application failures. Based on empirical studies of the availability/performance trade-off, several insights were gained, including the benefits and drawbacks of the two infrastruc- tures. The studies were performed using a realistic (telecommunication) application set up to run on top of both extended middleware platforms. Further, the thesis pro- poses a technique to improve performance in the FT-CORBA based middleware by exploiting application knowledge; to enrich application code with fault tolerance mechanisms we use aspect-oriented programming. In connection with part (2) the thesis models elements of an FT-CORBA like architecture mathematically, in particular by using queuing theory. The model is then used to study the relation between different parameters. This provides the means to configure one middleware parameter, namely the checkpointing interval, leading to maximal availability or minimal response time. This work has been supported by the European Project TRANSORG in the IST initiative EUTIST-AMI, the FP6 IST project DeDiSys on Dependable Dis- tributed Systems, and by CENIIT (Center for Industrial Information Technology) at Linköping University. Acknowledgments First and foremost I would like to thank my advisor Simin Nadjm-Tehrani for always challenging me to find new ideas. Without her this work would not have been accomplished. By letting me on my own feet in the last two years she contributed to my growing and my self-knowledge. I also want to thank my co-advisor Petru Eles for reading this thesis and giving valuable feedback. I thank John M. Noble, our collaborator from the Mathematics Department, for being a real help in the work with the checkpointing interval optimization. Without him, that work could not have been done. It was really rewarding to work with mathematics in a real-life context. I want to thank Nick Szirbik, over and over again, for proposing me the idea of doing PhD studies abroad. Since the summer of 1998 my life changed completely. My colleagues in RTSLAB offered a nice working environment with nice discussions. I especially thank Anne Moe for always sorting out difficult administra- tive questions in an efficient way. It will be difficult to leave them behind. My thanks go to Johan Moe for priceless discussions around the platform implementations and evaluations. I thank Isabelle Ravot for implementing the FA- CORBA platform during her visit in RTSLAB in 2002. I thank Torbjörn Ortengren¨ from Ericsson Radio AB for providing us the telecom application. My colleague C˘alin Curescu was the one to prepare the application for usage in our experiments. Thank you! I also want to thank Lillemor Wallgren and Britt-Inger Karlsson for helping me with the administration around the thesis defense. This work has been supported by the European Commission IST initiative, Project TRANSORG that is included in the cluster of projects EUTIST-AMI on Agents and Middleware Technologies applied in real industrial environments, by the FP6 IST project DeDiSys on Dependable Distributed Systems, and by CENIIT (Center for Industrial Information Technology) at Linköping University. I am endlessly grateful to my wonderful friends who, nearby or far away, always think about me and await me to come home. I would not be complete without them. I dedicate this work to my mother and late father, and to my one and only Sas¸a. Diana Szentiványi Contents 1 Introduction 1 1.1 Motivation and overview . 3 1.2 Problem description . 5 1.3 Contributions ............................ 6 1.4 Publications............................. 7 1.5 Thesisoutline ............................ 8 2 Terminology 11 2.1 Faults and failure models . 11 2.1.1 The notions of fault, error, and failure . 12 2.1.2 Failuremodels ....................... 13 2.1.3 Timingmodels ....................... 14 2.1.4 Unreliable failure detectors . 16 2.2 Achieving fault tolerance . 17 2.2.1 Software fault tolerance . 17 2.2.2 Replication strategies . 18 2.2.3 State and checkpointing . 19 2.3 Communication in fault-tolerant distributed systems . ....... 20 2.3.1 Broadcasts ......................... 20 2.3.2 Message ordering . 21 2.4 Consensus as a basic primitive . 21 2.5 Queuingtheory ........................... 22 2.5.1 Random variables . 23 2.5.2 Queues and related notions . 24 vii 3 Fault tolerance and middleware 27 3.1 Basic algorithms for consensus . 27 3.1.1 Consensus in synchronous systems . 28 3.1.2 Consensus in asynchronous systems . 28 3.2 Variants of unreliable failure detectors . 30 3.3 Perfect failure detectors . 31 3.4 The process group abstraction . 32 3.4.1 Groupservices ....................... 33 3.4.2 Specification and implementation of group services . 33 3.4.3 Fault-tolerant group based platforms . 34 3.5 Fault tolerance and commercial middleware . 35 3.6 Assessing performance . 37 3.7 Assessing availability . 38 4 Two fault-tolerant CORBA implementations 41 4.1 The FT-CORBA infrastructure . 42 4.1.1 Thestandard ........................ 42 4.1.2 Failure model and assumptions . 44 4.1.3 Infrastructure building blocks . 45 4.1.4 Logging and failover mechanism . 51 4.2 The FA-CORBA infrastructure . 56 4.2.1 Architecture units . 57 4.2.2 Infrastructure interactions . 58 4.2.3 FA-CORBA implementation . 60 4.3 Relatedwork ............................ 65 5 Empirical studies with a telecom application 71 5.1 Experiments with the telecom application . 71 5.1.1 The telecom application . 72 5.1.2 Experiment setup . 72 5.1.3 Measuring roundtrip time overheads . 74 5.2 Results................................ 76 5.2.1 Overheads ......................... 76 5.2.2 Failovertimes........................ 79 5.3 Lessonslearnt............................ 81 5.4 Reflections on the FT-CORBA infrastructure . 83 6 Improving performance of fault-tolerant software 85 6.1 Motivation.............................. 86 6.2 BasicAOPconcepts......................... 87 6.3 Aspects for fault tolerance mechanisms . 89 6.4 Implementation issues . 93 6.4.1 Aspects defined percflow . 95 6.4.2 Method execution related advices . 95 6.4.3 Field level synchronization advices . 97 6.5 Evaluation of the approach . 98 6.6 Discussion.............................. 100 6.7 AOPandmiddleware . 103 6.8 Summary .............................. 106 7 Computing the optimal checkpointing interval 107 7.1 Relatedwork ............................ 108 7.2 The checkpointing procedure . 112 7.3 Basicmodel............................. 115 7.4 Modelling assumptions . 116 7.5 Queuing analysis . 117 7.6 Optimal checkpointing for maximum availability . 118 7.7 Optimal checkpointing for minimum response time . 124 7.7.1 New assumptions . 125 7.7.2 Minimizing average response time . 126 7.8 Summary .............................. 136 8 Analysis of the optimization models 139 8.1 Numerical studies . 140 8.1.1 Relating response time to request arrival rate . 143 8.1.2 Relating availability to request arrival rate . 146 8.1.3 Relating checkpoint arrival rate to request arrival rate . 147 8.1.4 Relating checkpointing interval to request arrival rate . 149 8.1.5 Availability maximization and failure rate . 152 8.1.6 Response time minimization and failure rate . 155 8.1.7 Average availability-average response time trade-offs . 155 8.2 Simulations ............................. 158 8.2.1 Validation approach . 159 8.2.2 Validation results . 162 9 Conclusions and future work 167 9.1 Concluding remarks . 169 9.2 Futurework............................. 172 List of Figures 3.1 State transition diagram example . 39 4.1 Deployment of the FT-CORBA infrastructure . 46 4.2 Illustration

Performance Studies of Fault Tolerant Middleware

Failure Detectors for Wireless Sensor-Actuator Systems

IBM Research Report Proceedings of the IBM Phd Student Symposium at ICSOC 2006

Design and Architectures for Signal and Image Processing

Middleware-Based Database Replication: the Gaps Between Theory and Practice

Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks

Runtime Monitoring of Web Service Choreographies Using Streaming XML

Advanced Digital Broadcast Holdings S.A. Annual Report

Web Services, CORBA and Other Middleware

The Dynamic Enterprise Bus∗

Operating System Vs. Middleware

Low-Overhead Accrual Failure Detector

A Literature Review of Failure Detection Within the Context of Solving the Problem of Distributed Consensus