Performance Studies of Fault Tolerant Middleware

Total Page:16

File Type:pdf, Size:1020Kb

Performance Studies of Fault Tolerant Middleware 3HUIRUPDQFH6WXGLHVRI )DXOW7ROHUDQW0LGGOHZDUH 'LDQD6]HQWLYiQ\L Abstract Today’s software engineering and application development trend is to take advan- tage of reusable software. Much effort is directed towards easing the task of devel- oping complex, distributed, network based applications with reusable components. To ease the task of the distributed systems’ developers, one can use middleware, i.e. a software layer between the operating system and the application, which han- dles distribution transparently. A crucial feature of distributed server applications is high availability. This implies that they must be able to continue activity even in presence of crashes. Embedding fault tolerance mechanisms in the middleware on top of which the application is running, offers the potential to reduce applica- tion code size thereby reducing developer effort. Also, outage times due to server crashes can be reduced, as failover is taken care of automatically by middleware. However, a trade-off is involved: during periods with no failures, as information has to be collected for the automatic failover, client requests are serviced with higher latency. To characterize the real benefits of middleware, this trade-off needs to be studied. Unfortunately, to this date, few trade-off studies involving middleware that supports fault tolerance with application to realistic cases have been conducted. The contributions of the thesis are twofold: (1) insights based on empirical studies and (2) a theoretical analysis of components in a middleware equipped with fault tolerance mechanisms. In connection with part (1) the thesis describes detailed implementation of two platforms based on CORBA (Common Object Request Broker Architecture) with fault tolerance capabilities: one built by following the FT-CORBA standard, where only application failures are taken care of, and a second obtained by implementing an algorithm that ensures uniform treatment of infrastructure and application fail- ures. Based on empirical studies of the availability/performance trade-off, several insights were gained, including the benefits and drawbacks of the two infrastruc- tures. The studies were performed using a realistic (telecommunication) application set up to run on top of both extended middleware platforms. Further, the thesis pro- poses a technique to improve performance in the FT-CORBA based middleware by exploiting application knowledge; to enrich application code with fault toler- ance mechanisms we use aspect-oriented programming. In connection with part (2) the thesis models elements of an FT-CORBA like architecture mathematically, in particular by using queuing theory. The model is then used to study the relation between different parameters. This provides the means to configure one middle- ware parameter, namely the checkpointing interval, leading to maximal availability or minimal response time. This work has been supported by the European Project TRANSORG in the IST initiative EUTIST-AMI, the FP6 IST project DeDiSys on Dependable Dis- tributed Systems, and by CENIIT (Center for Industrial Information Technology) at Link¨oping University. Acknowledgments First and foremost I would like to thank my advisor Simin Nadjm-Tehrani for al- ways challenging me to find new ideas. Without her this work would not have been accomplished. By letting me on my own feet in the last two years she contributed to my growing and my self-knowledge. I also want to thank my co-advisor Petru Eles for reading this thesis and giving valuable feedback. I thank John M. Noble, our collaborator from the Mathematics Department, for being a real help in the work with the checkpointing interval optimization. Without him, that work could not have been done. It was really rewarding to work with mathematics in a real-life context. I want to thank Nick Szirbik, over and over again, for proposing me the idea of doing PhD studies abroad. Since the summer of 1998 my life changed completely. My colleagues in RTSLAB offered a nice working environment with nice dis- cussions. I especially thank Anne Moe for always sorting out difficult administra- tive questions in an efficient way. It will be difficult to leave them behind. My thanks go to Johan Moe for priceless discussions around the platform im- plementations and evaluations. I thank Isabelle Ravot for implementing the FA- CORBA platform during her visit in RTSLAB in 2002. I thank Torbj¨orn Ortengren¨ from Ericsson Radio AB for providing us the tele- com application. My colleague C˘alin Curescu was the one to prepare the applica- tion for usage in our experiments. Thank you! I also want to thank Lillemor Wallgren and Britt-Inger Karlsson for helping me with the administration around the thesis defense. This work has been supported by the European Commission IST initiative, Project TRANSORG that is included in the cluster of projects EUTIST-AMI on Agents and Middleware Technologies applied in real industrial environments, by the FP6 IST project DeDiSys on Dependable Distributed Systems, and by CENIIT (Center for Industrial Information Technology) at Link¨oping University. I am endlessly grateful to my wonderful friends who, nearby or far away, al- ways think about me and await me to come home. I would not be complete without them. I dedicate this work to my mother and late father, and to my one and only Sas¸a. Diana Szentiv´anyi Contents 1 Introduction 1 1.1 Motivation and overview . 3 1.2 Problem description . 5 1.3 Contributions ............................ 6 1.4 Publications............................. 7 1.5 Thesisoutline ............................ 8 2 Terminology 11 2.1 Faults and failure models . 11 2.1.1 The notions of fault, error, and failure . 12 2.1.2 Failuremodels ....................... 13 2.1.3 Timingmodels ....................... 14 2.1.4 Unreliable failure detectors . 16 2.2 Achieving fault tolerance . 17 2.2.1 Software fault tolerance . 17 2.2.2 Replication strategies . 18 2.2.3 State and checkpointing . 19 2.3 Communication in fault-tolerant distributed systems . ....... 20 2.3.1 Broadcasts ......................... 20 2.3.2 Message ordering . 21 2.4 Consensus as a basic primitive . 21 2.5 Queuingtheory ........................... 22 2.5.1 Random variables . 23 2.5.2 Queues and related notions . 24 vii 3 Fault tolerance and middleware 27 3.1 Basic algorithms for consensus . 27 3.1.1 Consensus in synchronous systems . 28 3.1.2 Consensus in asynchronous systems . 28 3.2 Variants of unreliable failure detectors . 30 3.3 Perfect failure detectors . 31 3.4 The process group abstraction . 32 3.4.1 Groupservices ....................... 33 3.4.2 Specification and implementation of group services . 33 3.4.3 Fault-tolerant group based platforms . 34 3.5 Fault tolerance and commercial middleware . 35 3.6 Assessing performance . 37 3.7 Assessing availability . 38 4 Two fault-tolerant CORBA implementations 41 4.1 The FT-CORBA infrastructure . 42 4.1.1 Thestandard ........................ 42 4.1.2 Failure model and assumptions . 44 4.1.3 Infrastructure building blocks . 45 4.1.4 Logging and failover mechanism . 51 4.2 The FA-CORBA infrastructure . 56 4.2.1 Architecture units . 57 4.2.2 Infrastructure interactions . 58 4.2.3 FA-CORBA implementation . 60 4.3 Relatedwork ............................ 65 5 Empirical studies with a telecom application 71 5.1 Experiments with the telecom application . 71 5.1.1 The telecom application . 72 5.1.2 Experiment setup . 72 5.1.3 Measuring roundtrip time overheads . 74 5.2 Results................................ 76 5.2.1 Overheads ......................... 76 5.2.2 Failovertimes........................ 79 5.3 Lessonslearnt............................ 81 5.4 Reflections on the FT-CORBA infrastructure . 83 6 Improving performance of fault-tolerant software 85 6.1 Motivation.............................. 86 6.2 BasicAOPconcepts......................... 87 6.3 Aspects for fault tolerance mechanisms . 89 6.4 Implementation issues . 93 6.4.1 Aspects defined percflow . 95 6.4.2 Method execution related advices . 95 6.4.3 Field level synchronization advices . 97 6.5 Evaluation of the approach . 98 6.6 Discussion.............................. 100 6.7 AOPandmiddleware . 103 6.8 Summary .............................. 106 7 Computing the optimal checkpointing interval 107 7.1 Relatedwork ............................ 108 7.2 The checkpointing procedure . 112 7.3 Basicmodel............................. 115 7.4 Modelling assumptions . 116 7.5 Queuing analysis . 117 7.6 Optimal checkpointing for maximum availability . 118 7.7 Optimal checkpointing for minimum response time . 124 7.7.1 New assumptions . 125 7.7.2 Minimizing average response time . 126 7.8 Summary .............................. 136 8 Analysis of the optimization models 139 8.1 Numerical studies . 140 8.1.1 Relating response time to request arrival rate . 143 8.1.2 Relating availability to request arrival rate . 146 8.1.3 Relating checkpoint arrival rate to request arrival rate . 147 8.1.4 Relating checkpointing interval to request arrival rate . 149 8.1.5 Availability maximization and failure rate . 152 8.1.6 Response time minimization and failure rate . 155 8.1.7 Average availability-average response time trade-offs . 155 8.2 Simulations ............................. 158 8.2.1 Validation approach . 159 8.2.2 Validation results . 162 9 Conclusions and future work 167 9.1 Concluding remarks . 169 9.2 Futurework............................. 172 List of Figures 3.1 State transition diagram example . 39 4.1 Deployment of the FT-CORBA infrastructure . 46 4.2 Illustration
Recommended publications
  • Failure Detectors for Wireless Sensor-Actuator Systems
    Cleveland State University EngagedScholarship@CSU Electrical Engineering & Computer Science Electrical Engineering & Computer Science Faculty Publications Department 7-2009 Failure Detectors for Wireless Sensor-Actuator Systems Hamza A. Zia Cleveland State University Nigamanth Sridhar Clevealand State University, [email protected] Shivakumar Sastry University of Akron Follow this and additional works at: https://engagedscholarship.csuohio.edu/enece_facpub Part of the Digital Communications and Networking Commons How does access to this work benefit ou?y Let us know! Publisher's Statement NOTICE: this is the author’s version of a work that was accepted for publication in Ad Hoc Networks. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Ad Hoc Networks, 7, 5, (07-01-2009); 10.1016/j.adhoc.2008.09.003 Original Citation Zia, H. A., Sridhar, N., , & Sastry, S. (2009). Failure detectors for wireless sensor-actuator systems. Ad Hoc Networks, 7(5), 1001-1013. doi:10.1016/j.adhoc.2008.09.003 Repository Citation Zia, Hamza A.; Sridhar, Nigamanth; and Sastry, Shivakumar, "Failure Detectors for Wireless Sensor-Actuator Systems" (2009). Electrical Engineering & Computer Science Faculty Publications. 64. https://engagedscholarship.csuohio.edu/enece_facpub/64 This Article is brought to you for free and open access by the Electrical Engineering & Computer Science Department at EngagedScholarship@CSU. It has been accepted for inclusion in Electrical Engineering & Computer Science Faculty Publications by an authorized administrator of EngagedScholarship@CSU.
    [Show full text]
  • IBM Research Report Proceedings of the IBM Phd Student Symposium at ICSOC 2006
    RC24118 (W0611-165) November 28, 2006 Computer Science IBM Research Report Proceedings of the IBM PhD Student Symposium at ICSOC 2006 Ed. by: Andreas Hanemann (Leibniz Supercomputing Center, Germany) Benedikt Kratz (Tilburg University, The Netherlands) Nirmal Mukhi (IBM Research, USA) Tudor Dumitras (Carnegie Mellon University, USA) Research Division Almaden - Austin - Beijing - Haifa - India - T. J. Watson - Tokyo - Zurich Preface Service-Oriented Computing (SoC) is a dynamic new field of research, creating a paradigm shift in the way software applications are designed and delivered. SoC technologies, through the use of open middleware standards, enable collab- oration across organizational boundaries and are transforming the information- technology landscape. SoC builds on ideas and experiences from many different fields to produce the novel research needed to drive this paradigm shift. The IBM PhD Student Symposium at ICSOC provides a forum where doc- toral students conducting research in SoC can present their on-going dissertation work and receive feedback from a group of well-known experts. Each presentation is organized as a mock thesis-defense, with a committee of 4 mentors providing extensive feedback and advice for completing a successful PhD thesis. This for- mat is similar to the one adopted by the doctoral symposia associated with ICSE, OOPSLA, ECOOP, Middleware and ISWC. The closing session of the symposium is a panel discussion where the roles are reversed, and the mentors answer the students’ questions about research careers in industry and academia. The symposium agenda also contains a keynote address on writing a good PhD dissertation, delivered by Dr. Priya Narasimhan, Assistant Professor at Carnegie Mellon University and member of the ACM Doctoral Dissertation Committee.
    [Show full text]
  • Design and Architectures for Signal and Image Processing
    EURASIP Journal on Embedded Systems Design and Architectures for Signal and Image Processing Guest Editors: Markus Rupp, Dragomir Milojevic, and Guy Gogniat Design and Architectures for Signal and Image Processing EURASIP Journal on Embedded Systems Design and Architectures for Signal and Image Processing Guest Editors: Markus Rupp, Dragomir Milojevic, and Guy Gogniat Copyright © 2008 Hindawi Publishing Corporation. All rights reserved. This is a special issue published in volume 2008 of “EURASIP Journal on Embedded Systems.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Editor-in-Chief Zoran Salcic, University of Auckland, New Zealand Associate Editors Sandro Bartolini, Italy Thomas Kaiser, Germany S. Ramesh, India Neil Bergmann, Australia Bart Kienhuis, The Netherlands Partha S. Roop, New Zealand Shuvra Bhattacharyya, USA Chong-Min Kyung, Korea Markus Rupp, Austria Ed Brinksma, The Netherlands Miriam Leeser, USA Asim Smailagic, USA Paul Caspi, France John McAllister, UK Leonel Sousa, Portugal Liang-Gee Chen, Taiwan Koji Nakano, Japan Jarmo Henrik Takala, Finland Dietmar Dietrich, Austria Antonio Nunez, Spain Jean-Pierre Talpin, France Stephen A. Edwards, USA Sri Parameswaran, Australia Jurgen¨ Teich, Germany Alain Girault, France Zebo Peng, Sweden Dongsheng Wang, China Rajesh K. Gupta, USA Marco Platzner, Germany Susumu Horiguchi, Japan Marc Pouzet, France Contents
    [Show full text]
  • Middleware-Based Database Replication: the Gaps Between Theory and Practice
    Appears in Proceedings of the ACM SIGMOD Conference, Vancouver, Canada (June 2008) Middleware-based Database Replication: The Gaps Between Theory and Practice Emmanuel Cecchet George Candea Anastasia Ailamaki EPFL EPFL & Aster Data Systems EPFL & Carnegie Mellon University Lausanne, Switzerland Lausanne, Switzerland Lausanne, Switzerland [email protected] [email protected] [email protected] ABSTRACT There exist replication “solutions” for every major DBMS, from Oracle RAC™, Streams™ and DataGuard™ to Slony-I for The need for high availability and performance in data Postgres, MySQL replication and cluster, and everything in- management systems has been fueling a long running interest in between. The naïve observer may conclude that such variety of database replication from both academia and industry. However, replication systems indicates a solved problem; the reality, academic groups often attack replication problems in isolation, however, is the exact opposite. Replication still falls short of overlooking the need for completeness in their solutions, while customer expectations, which explains the continued interest in developing new approaches, resulting in a dazzling variety of commercial teams take a holistic approach that often misses offerings. opportunities for fundamental innovation. This has created over time a gap between academic research and industrial practice. Even the “simple” cases are challenging at large scale. We deployed a replication system for a large travel ticket brokering This paper aims to characterize the gap along three axes: system at a Fortune-500 company faced with a workload where performance, availability, and administration. We build on our 95% of transactions were read-only. Still, the 5% write workload own experience developing and deploying replication systems in resulted in thousands of update requests per second, which commercial and academic settings, as well as on a large body of implied that a system using 2-phase-commit, or any other form of prior related work.
    [Show full text]
  • Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks
    Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks Timothy G. Armstrong, Zhao Zhang Daniel S. Katz, Michael Wilde, Ian T. Foster Department of Computer Science Computation Institute University of Chicago University of Chicago & Argonne National Laboratory [email protected], [email protected] [email protected], [email protected], [email protected] Abstract—In order for many-task applications to be attrac- as a worker and allocate one task per node. If tasks are tive candidates for running on high-end supercomputers, they single-threaded, each core or virtual thread can be treated must be able to benefit from the additional compute, I/O, as a worker. and communication performance provided by high-end HPC hardware relative to clusters, grids, or clouds. Typically this The second feature of many-task applications is an empha- means that the application should use the HPC resource in sis on high performance. The many tasks that make up the such a way that it can reduce time to solution beyond what application effectively collaborate to produce some result, is possible otherwise. Furthermore, it is necessary to make and in many cases it is important to get the results quickly. efficient use of the computational resources, achieving high This feature motivates the development of techniques to levels of utilization. Satisfying these twin goals is not trivial, because while the efficiently run many-task applications on HPC hardware. It parallelism in many task computations can vary over time, allows people to design and develop performance-critical on many large machines the allocation policy requires that applications in a many-task style and enables the scaling worker CPUs be provisioned and also relinquished in large up of existing many-task applications to run on much larger blocks rather than individually.
    [Show full text]
  • Runtime Monitoring of Web Service Choreographies Using Streaming XML
    Runtime Monitoring of Web Service Choreographies Using Streaming XML ∗ Sylvain Hall´e Roger Villemaire University of California, Santa Barbara Universit´edu Qu´ebec `aMontr´eal Department of Computer Science C.P. 8888, Succ. Centre-ville Santa Barbara, CA 9310-65110 Montreal, Canada H3C 3P8 [email protected] [email protected] ABSTRACT web services. Since most web services exchange XML mes- A wide range of web service choreography constraints on sages, one can refer to a recorded trace as an XML “docu- the content and sequentiality of messages can be translated ment” that can be analyzed using standard XML tools, us- into Linear Temporal Logic (LTL). Although they can be ing for example the XML Query Language (XQuery) [19,28]. checked statically on abstractions of actual services, it is However, most works attempting to tap on the resources desirable that violations of these specifications be also de- available in XQuery engines operate in a post mortem fash- tected at runtime. In this paper, we show that, given a ion: an instance of a choreography must be finished be- suitable translation of LTL formulæ into XQuery expres- fore analysis can take place on a complete XML document. sions, such runtime monitoring of choreography constraints While in some cases, a post mortem analysis on recorded is possible by feeding the trace of messages to a streaming traces is appropriate, there exist situations where violations XQuery processor. The forward-only fragment of LTL is in- of a specification must be addressed as soon as they are troduced; it represents the fragment of LTL supported by discovered.
    [Show full text]
  • Advanced Digital Broadcast Holdings S.A. Annual Report
    ADVANCED DIGITAL BROADCAST HOLDINGS S.A. ANNUAL REPORT 2006 Art Director: Ireneusz Golka Contents To Our Shareholders ……………… 6 Business, Operations and Strategy … 9 Corporate Governance …………… 29 Consolidated Financial Statements … 57 Statutory Financial Statements … 101 2006 : a technology year Business Highlights Revenue growing 5% to US$ 262 million Record half-year revenue at US$ 164.3 million 6 new Set-Top Box customers Hansenet (Germany, IPTV), Island Media (US, Satellite), ITI Neovision (Poland, Satellite), Telefonica O2 Czech Republic (Czech Republic, IPTV), Telecom Project 5 (Russia, Terrestrial), Jazztel (Spain, IPTV) Shift to high-end: HD-MPEG4 products at 20% of 2006 Revenue First significant sales in Americas 6% of the Group’s full-year revenue Long-term strategic partnership with ITI Neovision of Poland encompassing full-fledge collaboration, exclusivity and financial arrangements IPTV up 41%, Satellite up 140%, SW & Services up 220% compensated weak Italian DTT market and technical delays Awarded in Europe’s 500 - Entrepreneurs for Growth ranked 173 of 500 companies selected amongst 28 countries, for having achieved 58% annual compound average growth rate and the creation of more than 190 jobs, primarily in Europe, over 2002-2005 4 o ADB HOLDINGS o ANNUAL REPORT 2006 Strengthening ADB Group : High-End Focus a technology leader IN % OF 2006 DIGITAL TV EQUIPMENT SEGMENT REVENUE HD products in Italian RAI and UK BBC HD trials n MPEG2 SD, Others 62% n MPEG4 SD 16% World’s first hybrid, single-chip, n MPEG4 HD 22% advanced video coding,
    [Show full text]
  • Web Services, CORBA and Other Middleware
    Web Services, CORBA and other Middleware © Copyright IONA Technologies 2002 Dr. Seán Baker IONA Technologies Web Services For The Integrated Enterprise, OMG Workshop, Munich Feb 2003 Overview • There a number of different types of middleware – So what does Web Services offer? © Copyright IONA Technologies 2002 IONA © Copyright 2 2 Middleware • Middleware enables integration, but there are multiple – competing – choices: –CORBA –J2EE –.NET © Copyright IONA Technologies 2002 IONA © Copyright – Various MoM & EAI proprietary middleware – Web Sevices – the new kid on the block. 3 3 There’s lots of choice • Some based on technical grounds, including: – RPC versus message passing – Java specific versus multi-language – Direct versus indirect communication – Permanent versus occasional connection © Copyright IONA Technologies 2002 IONA © Copyright – Platform versus integration middleware • Some based on personal choice 4 4 Intra-enterprise versus inter-enterprise • Most middleware has been designed for intra- enterprise • Inter-enterprise adds at least two challenges – Firewalls ( & inter-enterprise security in general) © Copyright IONA Technologies 2002 IONA © Copyright – Different middleware may be used at the two ends • As well as different operating system, languages, etc 5 5 Web Services • Aims to address both of these issues – Its protocol is layered on HTTP • So it can flow through a firewall • This “cheat” raises security and other concerns, but ones that need to be addressed in any case – It uses XML to format messages © Copyright IONA
    [Show full text]
  • The Dynamic Enterprise Bus∗
    Fourth International Conference on Autonomic and Autonomous Systems The Dynamic Enterprise Bus∗ Dag Johansen Havard˚ Johansen Dept. of Computer Science Dept. of Computer Science University of Tromsø, Norway University of Tromsø, Norway Abstract This is where autonomic computing [18] comes into play. Our goal is that information access software should In this paper we present a prototype enterprise informa- keep manual control and interventions outside the compu- tion run-time heavily inspired by the fundamental principles tational loop as much as possible. Closely related is that of autonomic computing. Through self-configuration, self- such autonomic solutions should consolidate and utilize re- optimization, and self-healing techniques, this run-time tar- sources better, with the net effect that large-scale informa- gets next-generation extreme scale information-access sys- tion access systems can be built in smaller scale. In this tems. We demonstrate these concepts in a processing cluster vein, self-configuration, self-optimization, and self-healing that allocates resources dynamically upon demand. become important. We are building the next generation run-time for fu- ture generation information access systems. Autonomic be- 1 Introduction havior is fundamental in order to dynamically adapt and reconfigure to accommodate changing situations. Hence, the run-time needs efficient mechanisms for monitoring Enterprise search systems are ripe for change. In their in- and controlling applications and their resource, moving ap- fancy, these systems were primarily used for information re- plications transparently while in execution, scheduling re- trieval purposes, with main data sources being internal cor- sources in accordance to end-to-end service-level agree- porate archives.
    [Show full text]
  • Operating System Vs. Middleware
    What is Middleware? Application Application Middleware Middleware Operating System Operating System Software that functions as a conversion or translation layer. • It is also a consolidator and integrator. – Custom-programmed middleware solutions have been developed for decades to enable one application to communicate with another that either runs on a different platform or comes from a different vendor or both. • Today, there is a diverse group of products that offer packaged middleware solutions. AP 04/07 Operating System vs. Middleware Application Middleware Operating System AP 04/07 1 The Middleware Layer Distributed Application Distributed Application Middleware API Middleware API Middleware Middleware Operating System API Operating System API Operating System Operating System (Processes, Communication, (Processes, Communication, Memory Management) Memory Management) Network AP 04/07 Purpose and Origin • Middleware is connectivity software that consists of a set of enabling services that allow multiple processes running on one or more machines to interact across a network. • Middleware is essential to migrating mainframe applications to client/server applications and to providing for communication across heterogeneous platforms. – This technology has evolved during the 1990s to provide for interoperability in support of the move to client/server architectures (see Client/Server Software Architectures). • The most widely-publicized middleware initiatives are the – Open Software Foundation's Distributed Computing Environment (DCE), – Object Management Group's Common Object Request Broker Architecture (CORBA), and – Microsoft's COM/DCOM (Component Object Model) AP 04/07 2 Technical Detail • Middleware servi ces are set s of di st rib ut ed soft ware that exist between the application and the operating system and network services on a system node in the network.
    [Show full text]
  • Low-Overhead Accrual Failure Detector
    Sensors 2012, 12, 5815-5823; doi:10.3390/s120505815 OPEN ACCESS sensors ISSN 1424-8220 www.mdpi.com/journal/sensors Article Low-Overhead Accrual Failure Detector Xiao Ren, Jian Dong *, Hongwei Liu, Yang Li and Xiaozong Yang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China * Author to whom correspondence should be addressed; E-Mail: [email protected]; Tel.: +86-451-8640-3317; Fax: +86-451-8641-3309. Received: 6 March 2012; in revised form: 20 April 2012 / Accepted: 25 April 2012 / Published: 4 May 2012 Abstract: Failure detectors are one of the fundamental components for building a distributed system with high availability. In order to maintain the efficiency and scalability of failure detection in a complicated large-scale distributed system, accrual failure detectors that can adapt to multiple applications have been studied extensively. In this paper, an new accrual failure detector—LA-FD with low system overhead has been proposed specifically for current mobile network equipment on the Internet whose processing power, memory space and power supply are all constrained. It does not rely on the probability distribution of message transmission time, or on the maintenance of a history message window. By simple calculation, LA-FD provides adaptive failure detection service with high accuracy to multiple upper applications. The related experiments and results have also been presented. Keywords: failure detection; accrual failure detector; adaptive 1. Introduction Failure detector is one of the fundamental components for building a distributed system with high availability [1]. By providing the processes’ failure information to the system, it supports the solution of many basic issues (such as consensus and atomic broadcasting, etc.) in an asynchronous system.
    [Show full text]
  • A Literature Review of Failure Detection Within the Context of Solving the Problem of Distributed Consensus
    A literature review of failure detection Within the context of solving the problem of distributed consensus by Michael Duy-Nam Phan-Ba B.S., The University of Washington, 2010 AN ESSAY SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate and Postdoctoral Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) April 2015 c Michael Duy-Nam Phan-Ba 2015 Abstract As modern data centers grow in both size and complexity, the probabil- ity that components fail becomes significant enough to affect user-facing services [3]. These failures have the apparent consequence of invoking the impossibility result for distributed consensus in the presence of even one failure [15]. One way to solve the impossibility result is to use failure de- tectors [8]. In this essay, we present the theoretical models that allow us to solve consensus. Then, we discuss practical refinements to the models for the purposes of implementing failure detectors in practice. Finally, we con- clude by surveying common design patterns for building distributed failure detectors. ii Table of Contents Abstract ................................. ii Table of Contents ............................ iii List of Tables .............................. v List of Figures .............................. vi List of Algorithms . vii Acknowledgements . viii Dedication ................................ ix 1 Introduction ............................. 1 2 The theory behind failure detection and consensus . 3 2.1 The impossibility result for consensus . 3 2.2 A model of failure detection . 4 2.3 The weakest failure detector . 7 3 Binary failure detection with partial synchrony . 13 3.1 Pull failure detection . 14 3.2 Push failure detection . 15 3.3 Push-pull failure detection .
    [Show full text]